Add health checks to OIDC discovery provider #3151

chdalski · 2022-06-09T13:24:08Z

Pull Request check list

Commit conforms to CONTRIBUTING.md?
Proper tests/regressions included?
Documentation updated?

Affected functionality
Added health check endpoints to the oidc-discovery-provider service.

Description of change
Added readiness and liveness probes as configurable web-endpoints.

Which issue this PR fixes
Implements #3121

azdagron · 2022-06-09T15:48:35Z

Thanks for opening this @chdalski! When we discussed this, I assumed this would run on a separate port bound only to localhost. Now that this PR is in front of me, with the /live and /ready endpoints being implemented on the same handler that serves the normal content, I wonder why we can't just use the /keys endpoint as both the live and ready check....

chdalski · 2022-06-09T17:33:53Z

This was me misunderstanding you then, @azdagron.

I would argue against using the /keys endpoint for, at least, the readiness probe because it will need a valid node and workload attestation to be considered as "ready" which is imo not desirable.

I will refactor the code to clean up my misunderstanding and will use a dedicated listener on localhost.

Signed-off-by: Christoph Dalski <chdalski.coding@gmail.com>

chdalski · 2022-06-10T08:32:38Z

I gave it another try using a dedicated listener this time.

There is one thing I wasn't to sure about how to go and maybe you can share your thoughts with me.
The current insecureAddress binding takes just a address, which can be i. e. "localhost:8080" while the healthChecks endpoint is currently (in accordance to the spire server and agent healthChecks endpoints) split into address and port.
If that's not an issue I would leave it like that. What do you think?

azdagron

Thanks! I've added some more feedback. Regarding your question, I think we should disallow binding to anything other than localhost for health checks (which probably means just getting of that configurable).

I'm still not quite convinced that the ready/live state determination is correct. As far as I understand it:

Liveness means "this process should be kept alive", which is that it is either in a good state, or a state that it can recover from.
Readiness means "this process should be receiving traffic". This should only be the case if the process is in a good state, i.e., that it can respond to the "/keys" and "./well-known/openid-configuration" requests successfully.

Please correct me if my interpretation of liveness and readiness is wrong :)

With the way the provider is currently written, if we cannot fetch the JWKS from the source, we fail the requests to /keys. The current "ready" check is just checking that the process isn't hung. I don't think that is sufficient. To me, a process in this state isn't "ready" since it cannot provide good responses. I'd want a provider in this state to be pulled from the pool until it recovered. If we cached the JWKS retrieved from the source and served that cached copy when we failed to get a fresh one from a source, then I could see us continuing to be "ready" for at least some period of time.

Now, a process in that condition might still be ok to be "live", since it could recover (e.g, SPIRE agent was restarting and the Workload API was temporarily unavailable). How long we let a process like that live is in question. It might never recover without being rebooted or the container restarted.

The checks implemented in this PR don't really provide anything that we can't get with the "/keys" endpoint.

The "/keys" endpoint would provide the same behavior as the current "liveness" check, i.e. if the JWKS cannot be retrieved, it fails. Otherwise should return a 200.
A process that isn't "live" isn't "ready", so there doesn't seem to be a point to configuring a "ready" endpoint.

What do you think of the following proposal:

We retrieve the JWKS on every "/keys" request. The JWKS is cached. If we cannot retrieve the JWKS, the cached copy is returned. We keep track of the last time we refreshed the cached JWKS. If the cached copy is older than some threshold, we start failing the requests instead of returning the cached copy.
The process is "ready" if "/keys" is returning a successful response.
The process is "live" for as long as it is ready OR it isn't ready but hasn't surpassed some threshold where it should "give up".

This means that if the discovery provider boots but can't retrieve the JWKS, it will not be ready (since it shouldn't receive traffic), but will remain "live" for some time. This would give you time to register the workload or do whatever you need to do to allow the discovery provider to retrieve the JWKS.

Once it has the JWKS, it will go ready and start receiving traffic.

If at any point it stops being able to retrieve the JWKS, it will continue to be "live" and "ready" and serve up the cached JWKS. After the readiness threshold has passed, it will stop being "ready" since the JWKS is stale. It will stay "live" and continue attempting to retrieve the JWKS until an internal liveness threshold has passed, at which point it will stop being live and get restarted.

What are your thoughts?

support/oidc-discovery-provider/config.go

support/oidc-discovery-provider/main.go

azdagron · 2022-06-10T17:27:30Z

Also, thank you for going through this exercise and for your contributions here. I apologize for not thinking of all this when we were discussing the issue and am sorry for any wasted efforts on the implementation.

Signed-off-by: Christoph Dalski <chdalski.coding@gmail.com>

chdalski · 2022-06-13T09:28:19Z

Please don't mind it. I should rather have come better prepared as I'm no expert in the topic and I'm glad for the guidance.

Your proposal looks well thought through and I like it.

After looking at the code one questions came up nonetheless...

We retrieve the JWKS on every "/keys" request. The JWKS is cached. If we cannot retrieve the JWKS, the cached copy is returned. We keep track of the last time we refreshed the cached JWKS. If the cached copy is older than some threshold, we start failing the requests instead of returning the cached copy.

The "FetchKeySet" method gives back the JWKs when they were successfully fetched once and stored in the struct.
Currently there is no mechanism to reset the JWKs even if a follow up request to fetch new keys isn't successful anymore.
So whenever "FetchKeySet" is called it will provided you with the last fetched keys if it could at least fetch them once.

The mechanism you described would need to change that behavior to drop the last fetched version when a refetch fails and serve the cached version instead for the threshold time.

This would be a breaking change to the current behavior and it would like to make sure it's valid to do so?

chdalski · 2022-06-13T14:10:24Z

Thinking about it... an non-breaking alternative to the caching mechanism might be a timestamp base approach telling how old the current keys are setting the timestamp to the time of the latest successful fetch.
That way we could i.e. define the threshold to determine how old the keys are allowed to be.

azdagron · 2022-06-14T20:03:05Z

That sounds like a great plan. 💯 for maintaining compatible behavior.

…dingly Signed-off-by: Christoph Dalski <chdalski.coding@gmail.com>

chdalski · 2022-06-15T11:30:27Z

I updated the code according to our discussion.
Please let me know if it's to your liking.

azdagron

Thank you, @chdalski! I think we're getting very close here!

support/oidc-discovery-provider/healthchecks_handler.go

support/oidc-discovery-provider/README.md

…date documentation Signed-off-by: Christoph Dalski <chdalski.coding@gmail.com>

Signed-off-by: Christoph Dalski <chdalski.coding@gmail.com>

azdagron

LastPoll() is supposed to reflect the last successful polling time, but as written now, reflects the last attempt. Maybe I should have named it LastSuccessfulPoll to be more clear.

Anyway, if we correct LastPoll to return the last successful poll time, then that can be used exclusively to determine the health/liveness, since a non-zero poll time implies that we have a JWKS to serve.

I've left some individual comments where I think we need correction.

support/oidc-discovery-provider/server_api.go

support/oidc-discovery-provider/workload_api.go

support/oidc-discovery-provider/server_api.go

support/oidc-discovery-provider/workload_api.go

support/oidc-discovery-provider/healthchecks_handler.go

…y when there is no error while fetching the jwts Signed-off-by: Christoph Dalski <chdalski.coding@gmail.com>

chdalski · 2022-06-20T09:18:31Z

Thank you @azdagron.
I implemented the changes as suggested.
Please review it again.

azdagron

Thanks, @chdalski! I think we're good to go here.

azdagron · 2022-06-22T16:12:14Z

Hey @chdalski, looks like there is a misspelling that the linter caught. While you are fixing that, can you also rename healthCheck_handler_test.go to healthchecks_handler_test.go to be consistent?

Signed-off-by: Christoph Dalski <chdalski.coding@gmail.com>

….com:chdalski/spire into add-health-checks-to-oidc-discovery-provider

chdalski · 2022-06-22T18:00:26Z

@azdagron done. Thank you 👍

Signed-off-by: Christoph Dalski <chdalski.coding@gmail.com>

chdalski requested review from evan2645, amartinezfayo, azdagron, MarcosDY and rturner3 as code owners June 9, 2022 13:24

azdagron self-assigned this Jun 9, 2022

Add health checks listener to oidc-discovery-provider

1fad826

Signed-off-by: Christoph Dalski <chdalski.coding@gmail.com>

azdagron reviewed Jun 10, 2022

View reviewed changes

Remove configurable for health check listener and code cleanup

bc5084d

Signed-off-by: Christoph Dalski <chdalski.coding@gmail.com>

Add threshold for live and ready path and update implementation accor…

4e299ac

…dingly Signed-off-by: Christoph Dalski <chdalski.coding@gmail.com>

azdagron added this to the 1.3.2 milestone Jun 16, 2022

azdagron reviewed Jun 16, 2022

View reviewed changes

support/oidc-discovery-provider/healthchecks_handler.go Outdated Show resolved Hide resolved

support/oidc-discovery-provider/README.md Outdated Show resolved Hide resolved

support/oidc-discovery-provider/README.md Show resolved Hide resolved

chdalski added 2 commits June 17, 2022 12:06

Update health check handler to use pollTime instead of modTime and up…

0f21eb3

…date documentation Signed-off-by: Christoph Dalski <chdalski.coding@gmail.com>

(chore) Fixed typo

2f59469

Signed-off-by: Christoph Dalski <chdalski.coding@gmail.com>

azdagron reviewed Jun 17, 2022

View reviewed changes

Update oidc-discovery-service to update last successful poll time onl…

cdb625b

…y when there is no error while fetching the jwts Signed-off-by: Christoph Dalski <chdalski.coding@gmail.com>

azdagron previously approved these changes Jun 22, 2022

View reviewed changes

Merge branch 'main' into add-health-checks-to-oidc-discovery-provider

52076c8

chdalski added 2 commits June 22, 2022 19:55

(chore) fix typo and rename file for consistency

4e4788c

Signed-off-by: Christoph Dalski <chdalski.coding@gmail.com>

Merge branch 'add-health-checks-to-oidc-discovery-provider' of github…

33c23a3

….com:chdalski/spire into add-health-checks-to-oidc-discovery-provider

chdalski dismissed azdagron’s stale review via 33c23a3 June 22, 2022 17:58

Merge branch 'main' into add-health-checks-to-oidc-discovery-provider

c57e727

azdagron approved these changes Jun 22, 2022

View reviewed changes

azdagron merged commit e530498 into spiffe:main Jun 22, 2022

stevend-uber pushed a commit to stevend-uber/spire that referenced this pull request Oct 16, 2023

Add health checks to OIDC discovery provider (spiffe#3151)

956443c

Signed-off-by: Christoph Dalski <chdalski.coding@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add health checks to OIDC discovery provider #3151

Add health checks to OIDC discovery provider #3151

chdalski commented Jun 9, 2022

azdagron commented Jun 9, 2022 •

edited

chdalski commented Jun 9, 2022

chdalski commented Jun 10, 2022

azdagron left a comment

azdagron commented Jun 10, 2022

chdalski commented Jun 13, 2022 •

edited

chdalski commented Jun 13, 2022

azdagron commented Jun 14, 2022

chdalski commented Jun 15, 2022

azdagron left a comment

azdagron left a comment

chdalski commented Jun 20, 2022

azdagron left a comment

azdagron commented Jun 22, 2022

chdalski commented Jun 22, 2022

Add health checks to OIDC discovery provider #3151

Add health checks to OIDC discovery provider #3151

Conversation

chdalski commented Jun 9, 2022

azdagron commented Jun 9, 2022 • edited

chdalski commented Jun 9, 2022

chdalski commented Jun 10, 2022

azdagron left a comment

Choose a reason for hiding this comment

azdagron commented Jun 10, 2022

chdalski commented Jun 13, 2022 • edited

chdalski commented Jun 13, 2022

azdagron commented Jun 14, 2022

chdalski commented Jun 15, 2022

azdagron left a comment

Choose a reason for hiding this comment

azdagron left a comment

Choose a reason for hiding this comment

chdalski commented Jun 20, 2022

azdagron left a comment

Choose a reason for hiding this comment

azdagron commented Jun 22, 2022

chdalski commented Jun 22, 2022

azdagron commented Jun 9, 2022 •

edited

chdalski commented Jun 13, 2022 •

edited