New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
POC: Add an HTTP endpoint for liveness checks #2382
Conversation
Instead of halting the server on fixable problems, instead use a liveness check endpoint that does the same checks, and will allow the Kubernetes system to restart the pod if they fail. Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
s.logger.WithError(errors.WithStack(err)). | ||
Warnf("A backup storage location named %s has been specified for the server to use by default, but no corresponding backup storage location exists. Backups with a location not matching the default will need to explicitly specify an existing location", s.config.defaultBackupLocation) | ||
} | ||
// Anything that could be created after Velero has started will be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it probably makes sense to repeat the checks I deleted here before the liveness endpoint starts, but maybe just log their errors rather than returning them and halting.
mux.HandleFunc("/livez", func(w http.ResponseWriter, r *http.Request) { | ||
// Refresh the discovery helper, since it may be stale if the velero API group was | ||
// created after the pod. | ||
s.discoveryHelper.Refresh() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure the Refresh
call belongs here - it may be better inside of veleroResourcesExist
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
somewhat related, we already have a goroutine that refreshes discovery every 5min -- so we'll end up refrehsing a lot. Not sure yet where it makes sense to have it, just food for thought right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah; I added it because it wasn't picking up the CRDs even after I posted them. 5 minutes seems like a long time for a liveness check, but maybe we remove this refresh and change the other one to match whatever the check period is on the pod?
@nrb sure, I'll try this sometime later today and post the update here. |
Added an item to the community meeting tomorrow (https://hackmd.io/Jq6F5zqZR7S80CeDWUklkA?both#April-142020) to talk about this. |
@carlisia 👍 Please be sure to capture notes on the discussion! |
+1 to a liveness probe in general, however I'm not yet clear on what things we want to check in it -- I'm not sure the current BSL validation that's in there makes sense/solves #1967. Added some more comments at #1967 (comment). |
@skriss Agreed, I'll look at those scenarios. For this kind of POC, I just copied existing function calls into the liveness check. For the case where multiple BSLs exist, the default is valid and others aren't, the liveness check should pass. It's very likely that the liveness check we ultimately use will actually build on top of the work @carlisia is doing, as this version is fairly naive and shouldn't be merged as-is. |
// Anything that could be created after Velero has started will be | ||
// queried as part of the liveness check | ||
go func() { | ||
s.logger.Info("Starting liveness server at port 2000") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to have the port number "2000" either as a constant or configurable via variable. Same for line 861.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Failure in validateBackupStorageLocations should not necessary result failure in liveness check as long as there is at least 1 valid BackupStorageLocation.
return | ||
} | ||
|
||
if err := s.validateBackupStorageLocations(); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current logic of validateBackupStorageLocations would return error if there is any invalid BackupStorageLocation. Should Velero continue function if there is at least 1 valid BackupStorageLocation? If so, validateBackupStorageLocations should be enhanced to return more than just an error. For example return number of valid BackupStorageLocation and at this point we decide Velero should continue work if some valid BackupStorageLocation present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This validation will change after #2617.
We will change the validation logic to only error if there are no valid BackupStorageLocations. Else it will be only warnings if there's any invalid BSL.
I'm going to close this out, given #2674 merged. I don't think |
Instead of halting the server on fixable problems, instead use a
liveness check endpoint that does the same checks, and will allow the
Kubernetes system to restart the pod if they fail.
This change means that if you start a
velero server
locally, with just a namespace and none of the CRDs defined, you can docurl localhost:2000/livez
and get a 404 indicating there's a problem, which will be in the logs.Once all resources are in acceptable shape, a 200 page response will come back, and Velero will resume.
Should all BackupStorageLocations be deleted, or all the CRDs get removed, the liveness check will start returning 404 again.
This will make Velero behave the same whether a BSL exists on startup, or if it gets removed after Velero is already running.
I think this may be a possible fix for #1967. @betta1, could you please take a look and maybe try it out? This change does not yet include a change to the YAML to use the liveness probe.
Fixes #1967
Signed-off-by: Nolan Brubaker brubakern@vmware.com