POC: Add an HTTP endpoint for liveness checks #2382

nrb · 2020-04-01T01:59:35Z

Instead of halting the server on fixable problems, instead use a
liveness check endpoint that does the same checks, and will allow the
Kubernetes system to restart the pod if they fail.

This change means that if you start a velero server locally, with just a namespace and none of the CRDs defined, you can do curl localhost:2000/livez and get a 404 indicating there's a problem, which will be in the logs.

Once all resources are in acceptable shape, a 200 page response will come back, and Velero will resume.

Should all BackupStorageLocations be deleted, or all the CRDs get removed, the liveness check will start returning 404 again.

This will make Velero behave the same whether a BSL exists on startup, or if it gets removed after Velero is already running.

I think this may be a possible fix for #1967. @betta1, could you please take a look and maybe try it out? This change does not yet include a change to the YAML to use the liveness probe.

Fixes #1967

Signed-off-by: Nolan Brubaker brubakern@vmware.com

Instead of halting the server on fixable problems, instead use a liveness check endpoint that does the same checks, and will allow the Kubernetes system to restart the pod if they fail. Signed-off-by: Nolan Brubaker <brubakern@vmware.com>

nrb · 2020-04-01T02:00:20Z

@dymurray I see you also had activity on #1967 recently

nrb · 2020-04-01T02:06:02Z

pkg/cmd/server/server.go

-		s.logger.WithError(errors.WithStack(err)).
-			Warnf("A backup storage location named %s has been specified for the server to use by default, but no corresponding backup storage location exists. Backups with a location not matching the default will need to explicitly specify an existing location", s.config.defaultBackupLocation)
-	}
+	// Anything that could be created after Velero has started will be


I think it probably makes sense to repeat the checks I deleted here before the liveness endpoint starts, but maybe just log their errors rather than returning them and halting.

nrb · 2020-04-01T02:07:27Z

pkg/cmd/server/server.go

+	mux.HandleFunc("/livez", func(w http.ResponseWriter, r *http.Request) {
+		// Refresh the discovery helper, since it may be stale if the velero API group was
+		// created after the pod.
+		s.discoveryHelper.Refresh()


I'm not sure the Refresh call belongs here - it may be better inside of veleroResourcesExist.

somewhat related, we already have a goroutine that refreshes discovery every 5min -- so we'll end up refrehsing a lot. Not sure yet where it makes sense to have it, just food for thought right now.

Yeah; I added it because it wasn't picking up the CRDs even after I posted them. 5 minutes seems like a long time for a liveness check, but maybe we remove this refresh and change the other one to match whatever the check period is on the pod?

betta1 · 2020-04-01T15:03:31Z

@nrb sure, I'll try this sometime later today and post the update here.

carlisia · 2020-04-13T20:38:41Z

Added an item to the community meeting tomorrow (https://hackmd.io/Jq6F5zqZR7S80CeDWUklkA?both#April-142020) to talk about this.

nrb · 2020-04-13T21:00:55Z

@carlisia 👍 Please be sure to capture notes on the discussion!

skriss · 2020-04-14T21:01:31Z

+1 to a liveness probe in general, however I'm not yet clear on what things we want to check in it -- I'm not sure the current BSL validation that's in there makes sense/solves #1967. Added some more comments at #1967 (comment).

nrb · 2020-04-14T21:38:57Z

@skriss Agreed, I'll look at those scenarios. For this kind of POC, I just copied existing function calls into the liveness check. For the case where multiple BSLs exist, the default is valid and others aren't, the liveness check should pass.

It's very likely that the liveness check we ultimately use will actually build on top of the work @carlisia is doing, as this version is fairly naive and shouldn't be merged as-is.

phuongatemc · 2020-06-16T21:22:47Z

pkg/cmd/server/server.go

+	// Anything that could be created after Velero has started will be
+	// queried as part of the liveness check
+	go func() {
+		s.logger.Info("Starting liveness server at port 2000")


It would be nice to have the port number "2000" either as a constant or configurable via variable. Same for line 861.

phuongatemc

Failure in validateBackupStorageLocations should not necessary result failure in liveness check as long as there is at least 1 valid BackupStorageLocation.

phuongatemc · 2020-06-16T21:31:05Z

pkg/cmd/server/server.go

+			return
+		}
+
+		if err := s.validateBackupStorageLocations(); err != nil {


Current logic of validateBackupStorageLocations would return error if there is any invalid BackupStorageLocation. Should Velero continue function if there is at least 1 valid BackupStorageLocation? If so, validateBackupStorageLocations should be enhanced to return more than just an error. For example return number of valid BackupStorageLocation and at this point we decide Velero should continue work if some valid BackupStorageLocation present.

This validation will change after #2617.

We will change the validation logic to only error if there are no valid BackupStorageLocations. Else it will be only warnings if there's any invalid BSL.

nrb · 2020-07-14T22:59:21Z

I'm going to close this out, given #2674 merged. I don't think CrashLooping the Velero server accomplishes the goal, nor is the liveness endpoint the right place to expose this information at this point in time.

Add an HTTP endpoint for liveness checks

ccbd22b

Instead of halting the server on fixable problems, instead use a liveness check endpoint that does the same checks, and will allow the Kubernetes system to restart the pod if they fail. Signed-off-by: Nolan Brubaker <brubakern@vmware.com>

nrb requested review from carlisia, skriss and ashish-amarnath April 1, 2020 01:59

nrb commented Apr 1, 2020

View reviewed changes

nrb mentioned this pull request Apr 1, 2020

Initial redesign of CLI commands #2202

Merged

7 tasks

nrb mentioned this pull request Apr 10, 2020

On a misconfigured backupstoragelocation, the velero pod goes into a crashloopbackoff #2412

Closed

carlisia mentioned this pull request Apr 13, 2020

Add a BSL controller to handle validation + update BSL status phase #1967

Closed

2 tasks

nrb changed the title ~~Add an HTTP endpoint for liveness checks~~ POC: Add an HTTP endpoint for liveness checks Apr 14, 2020

nrb self-assigned this Jun 11, 2020

phuongatemc reviewed Jun 16, 2020

View reviewed changes

nrb mentioned this pull request Jul 14, 2020

Add a BSL controller to handle validation + update BSL status phase #2674

Merged

nrb closed this Jul 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

POC: Add an HTTP endpoint for liveness checks #2382

POC: Add an HTTP endpoint for liveness checks #2382

nrb commented Apr 1, 2020

nrb commented Apr 1, 2020

nrb Apr 1, 2020

nrb Apr 1, 2020

skriss Apr 14, 2020

nrb Apr 14, 2020

betta1 commented Apr 1, 2020

carlisia commented Apr 13, 2020

nrb commented Apr 13, 2020

skriss commented Apr 14, 2020

nrb commented Apr 14, 2020

phuongatemc Jun 16, 2020 •

edited

phuongatemc left a comment

phuongatemc Jun 16, 2020

carlisia Jun 22, 2020

nrb commented Jul 14, 2020

POC: Add an HTTP endpoint for liveness checks #2382

POC: Add an HTTP endpoint for liveness checks #2382

Conversation

nrb commented Apr 1, 2020

nrb commented Apr 1, 2020

nrb Apr 1, 2020

Choose a reason for hiding this comment

nrb Apr 1, 2020

Choose a reason for hiding this comment

skriss Apr 14, 2020

Choose a reason for hiding this comment

nrb Apr 14, 2020

Choose a reason for hiding this comment

betta1 commented Apr 1, 2020

carlisia commented Apr 13, 2020

nrb commented Apr 13, 2020

skriss commented Apr 14, 2020

nrb commented Apr 14, 2020

phuongatemc Jun 16, 2020 • edited

Choose a reason for hiding this comment

phuongatemc left a comment

Choose a reason for hiding this comment

phuongatemc Jun 16, 2020

Choose a reason for hiding this comment

carlisia Jun 22, 2020

Choose a reason for hiding this comment

nrb commented Jul 14, 2020

phuongatemc Jun 16, 2020 •

edited