Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add health checks to OIDC discovery provider #3151

Merged
merged 10 commits into from Jun 22, 2022
Merged

Add health checks to OIDC discovery provider #3151

merged 10 commits into from Jun 22, 2022

Conversation

chdalski
Copy link
Contributor

@chdalski chdalski commented Jun 9, 2022

Pull Request check list

  • Commit conforms to CONTRIBUTING.md?
  • Proper tests/regressions included?
  • Documentation updated?

Affected functionality
Added health check endpoints to the oidc-discovery-provider service.

Description of change
Added readiness and liveness probes as configurable web-endpoints.

Which issue this PR fixes
Implements #3121

@azdagron
Copy link
Member

azdagron commented Jun 9, 2022

Thanks for opening this @chdalski! When we discussed this, I assumed this would run on a separate port bound only to localhost. Now that this PR is in front of me, with the /live and /ready endpoints being implemented on the same handler that serves the normal content, I wonder why we can't just use the /keys endpoint as both the live and ready check....

@chdalski
Copy link
Contributor Author

chdalski commented Jun 9, 2022

This was me misunderstanding you then, @azdagron.

I would argue against using the /keys endpoint for, at least, the readiness probe because it will need a valid node and workload attestation to be considered as "ready" which is imo not desirable.

I will refactor the code to clean up my misunderstanding and will use a dedicated listener on localhost.

@azdagron azdagron self-assigned this Jun 9, 2022
Signed-off-by: Christoph Dalski <chdalski.coding@gmail.com>
@chdalski
Copy link
Contributor Author

I gave it another try using a dedicated listener this time.

There is one thing I wasn't to sure about how to go and maybe you can share your thoughts with me.
The current insecureAddress binding takes just a address, which can be i. e. "localhost:8080" while the healthChecks endpoint is currently (in accordance to the spire server and agent healthChecks endpoints) split into address and port.
If that's not an issue I would leave it like that. What do you think?

Copy link
Member

@azdagron azdagron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I've added some more feedback. Regarding your question, I think we should disallow binding to anything other than localhost for health checks (which probably means just getting of that configurable).

I'm still not quite convinced that the ready/live state determination is correct. As far as I understand it:

  • Liveness means "this process should be kept alive", which is that it is either in a good state, or a state that it can recover from.
  • Readiness means "this process should be receiving traffic". This should only be the case if the process is in a good state, i.e., that it can respond to the "/keys" and "./well-known/openid-configuration" requests successfully.

Please correct me if my interpretation of liveness and readiness is wrong :)

With the way the provider is currently written, if we cannot fetch the JWKS from the source, we fail the requests to /keys. The current "ready" check is just checking that the process isn't hung. I don't think that is sufficient. To me, a process in this state isn't "ready" since it cannot provide good responses. I'd want a provider in this state to be pulled from the pool until it recovered. If we cached the JWKS retrieved from the source and served that cached copy when we failed to get a fresh one from a source, then I could see us continuing to be "ready" for at least some period of time.

Now, a process in that condition might still be ok to be "live", since it could recover (e.g, SPIRE agent was restarting and the Workload API was temporarily unavailable). How long we let a process like that live is in question. It might never recover without being rebooted or the container restarted.

The checks implemented in this PR don't really provide anything that we can't get with the "/keys" endpoint.

  1. The "/keys" endpoint would provide the same behavior as the current "liveness" check, i.e. if the JWKS cannot be retrieved, it fails. Otherwise should return a 200.
  2. A process that isn't "live" isn't "ready", so there doesn't seem to be a point to configuring a "ready" endpoint.

What do you think of the following proposal:

  1. We retrieve the JWKS on every "/keys" request. The JWKS is cached. If we cannot retrieve the JWKS, the cached copy is returned. We keep track of the last time we refreshed the cached JWKS. If the cached copy is older than some threshold, we start failing the requests instead of returning the cached copy.
  2. The process is "ready" if "/keys" is returning a successful response.
  3. The process is "live" for as long as it is ready OR it isn't ready but hasn't surpassed some threshold where it should "give up".

This means that if the discovery provider boots but can't retrieve the JWKS, it will not be ready (since it shouldn't receive traffic), but will remain "live" for some time. This would give you time to register the workload or do whatever you need to do to allow the discovery provider to retrieve the JWKS.

Once it has the JWKS, it will go ready and start receiving traffic.

If at any point it stops being able to retrieve the JWKS, it will continue to be "live" and "ready" and serve up the cached JWKS. After the readiness threshold has passed, it will stop being "ready" since the JWKS is stale. It will stay "live" and continue attempting to retrieve the JWKS until an internal liveness threshold has passed, at which point it will stop being live and get restarted.

What are your thoughts?

support/oidc-discovery-provider/config.go Outdated Show resolved Hide resolved
support/oidc-discovery-provider/config.go Outdated Show resolved Hide resolved
support/oidc-discovery-provider/config.go Outdated Show resolved Hide resolved
support/oidc-discovery-provider/main.go Outdated Show resolved Hide resolved
@azdagron
Copy link
Member

Also, thank you for going through this exercise and for your contributions here. I apologize for not thinking of all this when we were discussing the issue and am sorry for any wasted efforts on the implementation.

Signed-off-by: Christoph Dalski <chdalski.coding@gmail.com>
@chdalski
Copy link
Contributor Author

chdalski commented Jun 13, 2022

Please don't mind it. I should rather have come better prepared as I'm no expert in the topic and I'm glad for the guidance.

Your proposal looks well thought through and I like it.

After looking at the code one questions came up nonetheless...

  1. We retrieve the JWKS on every "/keys" request. The JWKS is cached. If we cannot retrieve the JWKS, the cached copy is returned. We keep track of the last time we refreshed the cached JWKS. If the cached copy is older than some threshold, we start failing the requests instead of returning the cached copy.

The "FetchKeySet" method gives back the JWKs when they were successfully fetched once and stored in the struct.
Currently there is no mechanism to reset the JWKs even if a follow up request to fetch new keys isn't successful anymore.
So whenever "FetchKeySet" is called it will provided you with the last fetched keys if it could at least fetch them once.

The mechanism you described would need to change that behavior to drop the last fetched version when a refetch fails and serve the cached version instead for the threshold time.

This would be a breaking change to the current behavior and it would like to make sure it's valid to do so?

@chdalski
Copy link
Contributor Author

Thinking about it... an non-breaking alternative to the caching mechanism might be a timestamp base approach telling how old the current keys are setting the timestamp to the time of the latest successful fetch.
That way we could i.e. define the threshold to determine how old the keys are allowed to be.

@azdagron
Copy link
Member

That sounds like a great plan. 💯 for maintaining compatible behavior.

…dingly

Signed-off-by: Christoph Dalski <chdalski.coding@gmail.com>
@chdalski
Copy link
Contributor Author

I updated the code according to our discussion.
Please let me know if it's to your liking.

@azdagron azdagron added this to the 1.3.2 milestone Jun 16, 2022
Copy link
Member

@azdagron azdagron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @chdalski! I think we're getting very close here!

support/oidc-discovery-provider/healthchecks_handler.go Outdated Show resolved Hide resolved
support/oidc-discovery-provider/README.md Outdated Show resolved Hide resolved
support/oidc-discovery-provider/README.md Show resolved Hide resolved
…date documentation

Signed-off-by: Christoph Dalski <chdalski.coding@gmail.com>
Signed-off-by: Christoph Dalski <chdalski.coding@gmail.com>
Copy link
Member

@azdagron azdagron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LastPoll() is supposed to reflect the last successful polling time, but as written now, reflects the last attempt. Maybe I should have named it LastSuccessfulPoll to be more clear.

Anyway, if we correct LastPoll to return the last successful poll time, then that can be used exclusively to determine the health/liveness, since a non-zero poll time implies that we have a JWKS to serve.

I've left some individual comments where I think we need correction.

support/oidc-discovery-provider/server_api.go Outdated Show resolved Hide resolved
support/oidc-discovery-provider/workload_api.go Outdated Show resolved Hide resolved
support/oidc-discovery-provider/server_api.go Outdated Show resolved Hide resolved
support/oidc-discovery-provider/workload_api.go Outdated Show resolved Hide resolved
support/oidc-discovery-provider/healthchecks_handler.go Outdated Show resolved Hide resolved
support/oidc-discovery-provider/healthchecks_handler.go Outdated Show resolved Hide resolved
…y when there is no error while fetching the jwts

Signed-off-by: Christoph Dalski <chdalski.coding@gmail.com>
@chdalski
Copy link
Contributor Author

Thank you @azdagron.
I implemented the changes as suggested.
Please review it again.

azdagron
azdagron previously approved these changes Jun 22, 2022
Copy link
Member

@azdagron azdagron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @chdalski! I think we're good to go here.

@azdagron
Copy link
Member

Hey @chdalski, looks like there is a misspelling that the linter caught. While you are fixing that, can you also rename healthCheck_handler_test.go to healthchecks_handler_test.go to be consistent?

Signed-off-by: Christoph Dalski <chdalski.coding@gmail.com>
….com:chdalski/spire into add-health-checks-to-oidc-discovery-provider
@chdalski
Copy link
Contributor Author

@azdagron done. Thank you 👍

@azdagron azdagron merged commit e530498 into spiffe:main Jun 22, 2022
stevend-uber pushed a commit to stevend-uber/spire that referenced this pull request Oct 16, 2023
Signed-off-by: Christoph Dalski <chdalski.coding@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants