New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add health checks to OIDC discovery provider #3151
Add health checks to OIDC discovery provider #3151
Conversation
Thanks for opening this @chdalski! When we discussed this, I assumed this would run on a separate port bound only to localhost. Now that this PR is in front of me, with the |
This was me misunderstanding you then, @azdagron. I would argue against using the /keys endpoint for, at least, the readiness probe because it will need a valid node and workload attestation to be considered as "ready" which is imo not desirable. I will refactor the code to clean up my misunderstanding and will use a dedicated listener on localhost. |
Signed-off-by: Christoph Dalski <chdalski.coding@gmail.com>
I gave it another try using a dedicated listener this time. There is one thing I wasn't to sure about how to go and maybe you can share your thoughts with me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I've added some more feedback. Regarding your question, I think we should disallow binding to anything other than localhost for health checks (which probably means just getting of that configurable).
I'm still not quite convinced that the ready/live state determination is correct. As far as I understand it:
- Liveness means "this process should be kept alive", which is that it is either in a good state, or a state that it can recover from.
- Readiness means "this process should be receiving traffic". This should only be the case if the process is in a good state, i.e., that it can respond to the "/keys" and "./well-known/openid-configuration" requests successfully.
Please correct me if my interpretation of liveness and readiness is wrong :)
With the way the provider is currently written, if we cannot fetch the JWKS from the source, we fail the requests to /keys
. The current "ready" check is just checking that the process isn't hung. I don't think that is sufficient. To me, a process in this state isn't "ready" since it cannot provide good responses. I'd want a provider in this state to be pulled from the pool until it recovered. If we cached the JWKS retrieved from the source and served that cached copy when we failed to get a fresh one from a source, then I could see us continuing to be "ready" for at least some period of time.
Now, a process in that condition might still be ok to be "live", since it could recover (e.g, SPIRE agent was restarting and the Workload API was temporarily unavailable). How long we let a process like that live is in question. It might never recover without being rebooted or the container restarted.
The checks implemented in this PR don't really provide anything that we can't get with the "/keys" endpoint.
- The "/keys" endpoint would provide the same behavior as the current "liveness" check, i.e. if the JWKS cannot be retrieved, it fails. Otherwise should return a 200.
- A process that isn't "live" isn't "ready", so there doesn't seem to be a point to configuring a "ready" endpoint.
What do you think of the following proposal:
- We retrieve the JWKS on every "/keys" request. The JWKS is cached. If we cannot retrieve the JWKS, the cached copy is returned. We keep track of the last time we refreshed the cached JWKS. If the cached copy is older than some threshold, we start failing the requests instead of returning the cached copy.
- The process is "ready" if "/keys" is returning a successful response.
- The process is "live" for as long as it is ready OR it isn't ready but hasn't surpassed some threshold where it should "give up".
This means that if the discovery provider boots but can't retrieve the JWKS, it will not be ready (since it shouldn't receive traffic), but will remain "live" for some time. This would give you time to register the workload or do whatever you need to do to allow the discovery provider to retrieve the JWKS.
Once it has the JWKS, it will go ready and start receiving traffic.
If at any point it stops being able to retrieve the JWKS, it will continue to be "live" and "ready" and serve up the cached JWKS. After the readiness threshold has passed, it will stop being "ready" since the JWKS is stale. It will stay "live" and continue attempting to retrieve the JWKS until an internal liveness threshold has passed, at which point it will stop being live and get restarted.
What are your thoughts?
Also, thank you for going through this exercise and for your contributions here. I apologize for not thinking of all this when we were discussing the issue and am sorry for any wasted efforts on the implementation. |
Signed-off-by: Christoph Dalski <chdalski.coding@gmail.com>
Please don't mind it. I should rather have come better prepared as I'm no expert in the topic and I'm glad for the guidance. Your proposal looks well thought through and I like it. After looking at the code one questions came up nonetheless...
The "FetchKeySet" method gives back the JWKs when they were successfully fetched once and stored in the struct. The mechanism you described would need to change that behavior to drop the last fetched version when a refetch fails and serve the cached version instead for the threshold time. This would be a breaking change to the current behavior and it would like to make sure it's valid to do so? |
Thinking about it... an non-breaking alternative to the caching mechanism might be a timestamp base approach telling how old the current keys are setting the timestamp to the time of the latest successful fetch. |
That sounds like a great plan. 💯 for maintaining compatible behavior. |
…dingly Signed-off-by: Christoph Dalski <chdalski.coding@gmail.com>
I updated the code according to our discussion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @chdalski! I think we're getting very close here!
…date documentation Signed-off-by: Christoph Dalski <chdalski.coding@gmail.com>
Signed-off-by: Christoph Dalski <chdalski.coding@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LastPoll() is supposed to reflect the last successful polling time, but as written now, reflects the last attempt. Maybe I should have named it LastSuccessfulPoll
to be more clear.
Anyway, if we correct LastPoll to return the last successful poll time, then that can be used exclusively to determine the health/liveness, since a non-zero poll time implies that we have a JWKS to serve.
I've left some individual comments where I think we need correction.
…y when there is no error while fetching the jwts Signed-off-by: Christoph Dalski <chdalski.coding@gmail.com>
Thank you @azdagron. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @chdalski! I think we're good to go here.
Hey @chdalski, looks like there is a misspelling that the linter caught. While you are fixing that, can you also rename |
Signed-off-by: Christoph Dalski <chdalski.coding@gmail.com>
….com:chdalski/spire into add-health-checks-to-oidc-discovery-provider
@azdagron done. Thank you 👍 |
Signed-off-by: Christoph Dalski <chdalski.coding@gmail.com>
Pull Request check list
Affected functionality
Added health check endpoints to the oidc-discovery-provider service.
Description of change
Added readiness and liveness probes as configurable web-endpoints.
Which issue this PR fixes
Implements #3121