Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC: Add an HTTP endpoint for liveness checks #2382

Closed
wants to merge 1 commit into from

Conversation

nrb
Copy link
Contributor

@nrb nrb commented Apr 1, 2020

Instead of halting the server on fixable problems, instead use a
liveness check endpoint that does the same checks, and will allow the
Kubernetes system to restart the pod if they fail.

This change means that if you start a velero server locally, with just a namespace and none of the CRDs defined, you can do curl localhost:2000/livez and get a 404 indicating there's a problem, which will be in the logs.

Once all resources are in acceptable shape, a 200 page response will come back, and Velero will resume.

Should all BackupStorageLocations be deleted, or all the CRDs get removed, the liveness check will start returning 404 again.

This will make Velero behave the same whether a BSL exists on startup, or if it gets removed after Velero is already running.

I think this may be a possible fix for #1967. @betta1, could you please take a look and maybe try it out? This change does not yet include a change to the YAML to use the liveness probe.

Fixes #1967

Signed-off-by: Nolan Brubaker brubakern@vmware.com

Instead of halting the server on fixable problems, instead use a
liveness check endpoint that does the same checks, and will allow the
Kubernetes system to restart the pod if they fail.

Signed-off-by: Nolan Brubaker <brubakern@vmware.com>
@nrb
Copy link
Contributor Author

nrb commented Apr 1, 2020

@dymurray I see you also had activity on #1967 recently

s.logger.WithError(errors.WithStack(err)).
Warnf("A backup storage location named %s has been specified for the server to use by default, but no corresponding backup storage location exists. Backups with a location not matching the default will need to explicitly specify an existing location", s.config.defaultBackupLocation)
}
// Anything that could be created after Velero has started will be
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it probably makes sense to repeat the checks I deleted here before the liveness endpoint starts, but maybe just log their errors rather than returning them and halting.

mux.HandleFunc("/livez", func(w http.ResponseWriter, r *http.Request) {
// Refresh the discovery helper, since it may be stale if the velero API group was
// created after the pod.
s.discoveryHelper.Refresh()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the Refresh call belongs here - it may be better inside of veleroResourcesExist.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

somewhat related, we already have a goroutine that refreshes discovery every 5min -- so we'll end up refrehsing a lot. Not sure yet where it makes sense to have it, just food for thought right now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah; I added it because it wasn't picking up the CRDs even after I posted them. 5 minutes seems like a long time for a liveness check, but maybe we remove this refresh and change the other one to match whatever the check period is on the pod?

@nrb nrb mentioned this pull request Apr 1, 2020
7 tasks
@betta1
Copy link
Contributor

betta1 commented Apr 1, 2020

@nrb sure, I'll try this sometime later today and post the update here.

@carlisia
Copy link
Contributor

Added an item to the community meeting tomorrow (https://hackmd.io/Jq6F5zqZR7S80CeDWUklkA?both#April-142020) to talk about this.

@nrb
Copy link
Contributor Author

nrb commented Apr 13, 2020

@carlisia 👍 Please be sure to capture notes on the discussion!

@skriss
Copy link
Member

skriss commented Apr 14, 2020

+1 to a liveness probe in general, however I'm not yet clear on what things we want to check in it -- I'm not sure the current BSL validation that's in there makes sense/solves #1967. Added some more comments at #1967 (comment).

@nrb
Copy link
Contributor Author

nrb commented Apr 14, 2020

@skriss Agreed, I'll look at those scenarios. For this kind of POC, I just copied existing function calls into the liveness check. For the case where multiple BSLs exist, the default is valid and others aren't, the liveness check should pass.

It's very likely that the liveness check we ultimately use will actually build on top of the work @carlisia is doing, as this version is fairly naive and shouldn't be merged as-is.

@nrb nrb changed the title Add an HTTP endpoint for liveness checks POC: Add an HTTP endpoint for liveness checks Apr 14, 2020
@nrb nrb self-assigned this Jun 11, 2020
// Anything that could be created after Velero has started will be
// queried as part of the liveness check
go func() {
s.logger.Info("Starting liveness server at port 2000")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have the port number "2000" either as a constant or configurable via variable. Same for line 861.

Copy link
Contributor

@phuongatemc phuongatemc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failure in validateBackupStorageLocations should not necessary result failure in liveness check as long as there is at least 1 valid BackupStorageLocation.

return
}

if err := s.validateBackupStorageLocations(); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current logic of validateBackupStorageLocations would return error if there is any invalid BackupStorageLocation. Should Velero continue function if there is at least 1 valid BackupStorageLocation? If so, validateBackupStorageLocations should be enhanced to return more than just an error. For example return number of valid BackupStorageLocation and at this point we decide Velero should continue work if some valid BackupStorageLocation present.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This validation will change after #2617.

We will change the validation logic to only error if there are no valid BackupStorageLocations. Else it will be only warnings if there's any invalid BSL.

@nrb
Copy link
Contributor Author

nrb commented Jul 14, 2020

I'm going to close this out, given #2674 merged. I don't think CrashLooping the Velero server accomplishes the goal, nor is the liveness endpoint the right place to expose this information at this point in time.

@nrb nrb closed this Jul 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a BSL controller to handle validation + update BSL status phase
5 participants