Record backup start and completion times, add timing metrics #564

nrb · 2018-06-20T19:41:31Z

Signed-off-by: Nolan Brubaker nolan@heptio.com

nrb · 2018-06-20T22:47:22Z

CI's currently failing due to a known issue in Testify, stemming from a time package behavior. In summary, JSON marshalling and unmarshalling times in Go drop the monotonic clock, and alter the time.Location property. This affects Ark because the startTimestamp and completionTimestamp values will be updated via HTTP patch calls with JSON payloads.

We can fix this with the time.Equal function (which the Go docs recommend), but will need to do a bit of extra conversion from the interface{} that ValidatePatch currently compares, and probably write a helper to hook into the *testing.T facilities. We could also use apimachinery's DeepEqual function, though it's using the advised-against == comparison method.

@ncdc @skriss Any preferences on approach here? Is this something we need to be aware of in general for restores, given that all times going through Ark will encounter this issue?

nrb · 2018-06-20T22:51:28Z

@ashish-amarnath Since you've been working on metrics code, I wanted to let you know I'm grabbing the backup time/duration ones in this PR.

ashish-amarnath · 2018-06-21T07:49:55Z

@nrb I see you are only capturing the backup duration for now. Feel free to get a hold of me either here or on the #ark-dr slack channel if you need to consult on the metrics package.

ncdc · 2018-06-21T14:03:24Z

We could also use apimachinery's DeepEqual function, though it's using the advised-against == comparison method.

UTC() actually ends up stripping the monotonic clock field, so == works in this case. I vote we go this route for now.

ashish-amarnath · 2018-06-21T19:31:24Z

pkg/metrics/metrics.go

@@ -31,10 +31,17 @@ const (
 	backupAttemptCount          = "backup_attempt_total"
 	backupSuccessCount          = "backup_success_total"
 	backupFailureCount          = "backup_failure_total"
+	backupSecondsCount          = "backup_seconds_total"


Suggest re-naming to this backupDurationSeconds = "backup_duration_seconds".
The convention that I've seen is I count 'events' and then report them as the total number of events, seen in the look back time window.
In this case, what we are capturing is how long it took to run the backup which is not an event. It is kinda similar to the backup size.

ashish-amarnath · 2018-06-21T19:36:42Z

pkg/metrics/metrics.go


 	scheduleLabel = "schedule"
 )

+// SecondsInMinute returns the number of seconds in a minute.
+// This is mostly a helper to create Prometheus histogram buckets cleanly.
+func SecondsInMinute(minutes uint64) float64 {


Not sure you need this method. Suggest creating a const like secondsInMinute = 60 and

Buckets: []float64{ SecondsInMinute(1 * secondsInMinute), SecondsInMinute(2 * secondsInMinute), SecondsInMinute(3 * secondsInMinute), SecondsInMinute(4 * secondsInMinute), SecondsInMinute(5 * secondsInMinute), SecondsInMinute(6 * secondsInMinute), SecondsInMinute(7 * secondsInMinute), SecondsInMinute(8 * secondsInMinute), SecondsInMinute(9 * secondsInMinute), SecondsInMinute(10 * secondsInMinute), },

WDYT?

Good suggestion, I'll go with a float64 constant and use it directly, without a function call.

ashish-amarnath · 2018-06-21T19:41:19Z

pkg/metrics/metrics.go

+				prometheus.HistogramOpts{
+					Namespace: metricNamespace,
+					Name:      backupSecondsCount,
+					Help:      "Total seconds taken for backups",


Suggest changing this to

Name: backupDurationSeconds, Help: "Time taken to complete backup, in seconds",

ncdc · 2018-06-21T19:44:46Z

pkg/controller/backup_controller_test.go

@@ -231,6 +235,19 @@ func TestProcessBackup(t *testing.T) {
 				res.Status.Expiration.Time = expiration
 				res.Status.Phase = v1.BackupPhase(phase)

+				// We don't care about the value of the timestamps here, just whether or


Is it not possible to use the value from the patch?

It is - I can extract it and parse it to make the test more accurate. I was taking a bit of a shortcut here to avoid that work, but I'll update it.

ncdc · 2018-06-22T18:45:16Z

pkg/metrics/metrics.go

@@ -140,7 +136,7 @@ func (m *ServerMetrics) RegisterBackupFailed(backupSchedule string) {

 // RegisterBackupSeconds records the number of seconds a backup took.
 func (m *ServerMetrics) RegisterBackupSeconds(backupSchedule string, seconds float64) {


Call this RegisterBackupDuration?

ashish-amarnath · 2018-06-22T19:31:29Z

pkg/controller/backup_controller_test.go

@@ -231,6 +235,23 @@ func TestProcessBackup(t *testing.T) {
 				res.Status.Expiration.Time = expiration
 				res.Status.Phase = v1.BackupPhase(phase)

+				// We don't care about the value of the timestamps here, just whether or
+				// not they're present in the patchMap.
+				// If there's an error, it's mostly likely that the key wasn't found


nit: // If there's an error, it's most likely that the key wasn't found

ashish-amarnath · 2018-06-22T19:34:54Z

pkg/metrics/metrics.go

 func (m *ServerMetrics) RegisterBackupFailed(backupSchedule string) {
 	if c, ok := m.metrics[backupFailureCount].(*prometheus.CounterVec); ok {
 		c.WithLabelValues(backupSchedule).Inc()
 	}
 }
+
+// RegisterBackupSeconds records the number of seconds a backup took.
+func (m *ServerMetrics) RegisterBackupSeconds(backupSchedule string, seconds float64) {


suggest naming this method RegisterBackupDuration.

ncdc · 2018-06-26T15:50:16Z

After reading up a bit more on Prometheus histograms, this PR divides the backup duration into 10 buckets, from 1 to 10 ~~seconds~~ minutes, plus an additional +Inf bucket for anything over 10 ~~seconds~~ minutes. The histogram is essentially a group of counters, and it's cumulative, so if we execute 1 backup and it takes 8 ~~seconds~~ minutes to run, the buckets for 8, 9, 10, and +Inf would all show 1 (remember, it's a count, not an actual duration).

Unfortunately, the buckets have to be hard-coded up front when we create the histogram. I'm wondering if 1-10 ~~seconds~~ minutes is reasonable. I'm thinking we might need a greater range, like 5s, 30s, 1m, 5m, 15m, 30m, etc.

I'm definitely looking for guidance from people more experienced with metrics gathering than I am 😄

ncdc · 2018-06-26T15:24:39Z

pkg/cmd/util/output/backup_describer.go

+	d.Println()
+	// "<n/a>" output should only be applicable for backups that failed validation
+	if status.StartTimestamp.Time.IsZero() {
+		d.Printf("Start Time:\t%s\n", "<n/a>")


Let's call this Started?

ncdc · 2018-06-26T15:24:48Z

pkg/cmd/util/output/backup_describer.go

+		d.Printf("Start Time:\t%s\n", status.StartTimestamp.Time)
+	}
+	if status.CompletionTimestamp.Time.IsZero() {
+		d.Printf("Completion Time:\t%s\n", "<n/a>")


Let's call this Completed?

ncdc · 2018-06-26T15:32:10Z

pkg/cmd/util/output/backup_describer.go

@@ -163,6 +163,18 @@ func DescribeBackupStatus(d *Describer, status v1.BackupStatus) {

 	d.Println()
 	d.Printf("Expiration:\t%s\n", status.Expiration.Time)
+	d.Println()
+	// "<n/a>" output should only be applicable for backups that failed validation


I would put started & completed above expiration

nrb · 2018-06-26T16:09:02Z

this PR divides the backup duration into 10 buckets, from 1 to 10 seconds, plus an additional +Inf bucket for anything over 10 seconds.

The buckets are definitely 1-10 minutes. https://github.com/prometheus/client_golang/blob/master/prometheus/histogram.go#L54 shows the default values, and states that the values are based on seconds.

That said I'm not confident that this is a good distribution. It may take some testing to tune these.

ncdc · 2018-06-26T16:09:58Z

My mistake - 1-10 minutes is right. Thanks for the correction!

mattmoyer · 2018-06-26T16:20:28Z

pkg/metrics/metrics.go

+						7 * secondsInMinute,
+						8 * secondsInMinute,
+						9 * secondsInMinute,
+						10 * secondsInMinute,


This will mean any backup longer than 10 minutes is counted as "infinite". Is that reasonable?

It might be useful to use exponential buckets to get more resolution at the higher end of this range.

You could also keep this distribution but cleanup the code a little using LinearBuckets like:

Buckets: prometheus.LinearBuckets(0.0, secondsInMinute, 10),

stevesloka · 2018-06-26T16:29:27Z

The buckets can be tricky to define since they are typically specific to an environment. I'd recommend sane defaults to start, then allow users to define those buckets if they need to outside the defaults.

nrb · 2018-06-26T16:33:09Z

@stevesloka How would you recommend letting users define them? For what it's worth, we're considering getting rid of the Ark Config CRD.

ncdc · 2018-06-26T16:35:37Z

Anything in config that doesn't move to a backup target will most likely become a flag to ark server

stevesloka · 2018-06-26T16:36:33Z

I'm not sure specifically in Ark, but should be something on the server since you can only configure it once.

skriss

not much to add, just one question

skriss · 2018-06-27T18:00:12Z

pkg/controller/backup_controller.go

@@ -392,6 +393,11 @@ func (controller *backupController) runBackup(backup *api.Backup, bucket string)
 	backupScheduleName := backup.GetLabels()["ark-schedule"]
 	controller.metrics.SetBackupTarballSizeBytesGauge(backupScheduleName, backupSizeBytes)

+	backup.Status.CompletionTimestamp.Time = controller.clock.Now()
+	backupDuration := backup.Status.CompletionTimestamp.Time.Sub(backup.Status.StartTimestamp.Time)
+	backupDurationSeconds := float64(backupDuration / time.Second)


why's the / time.Second needed? (Not saying it's wrong, just wondering)

Good catch - that should be multiplied to get the seconds, since Duration is nanoseconds.

Actually I'm confusing myself here - this is correct. We need to divide to get the seconds out of a Duration.

Multiplication would be seconds to nanoseconds.

To elaborate a bit more: the histogram we're building uses seconds as it's unit, which is why we're going from ns to seconds.

got it, ok. duration math is weird :)

ncdc · 2018-06-27T20:17:07Z

pkg/metrics/metrics.go

+					Name:      backupDurationSeconds,
+					Help:      "Time taken to complete backup, in seconds",
+					// Use 10 exponential buckets starting at 30s
+					Buckets: prometheus.ExponentialBuckets(30.0, 3, 10),


This would create the following buckets, correct?

30s

1m30s

4m30s

13m30s

40m30s

2h1m30s

6h4m30s

18h13m30s

54h40m30s

164h1m30s

Yeah - I think 9 & 10 might be too high, perhaps even 8.

I think Linear buckets of 1 minute may leave out the high end if we only have 10, but I don't know if there's any sort of penalty for making many buckets that go unused.

ncdc · 2018-06-27T20:19:16Z

pkg/util/test/comparisons.go

+	err := equality.Semantic.AddFunc(TimesAreEqual)
+	if err != nil {
+		// Programmer error, the service should die.
+		panic(errors.Wrap(err, "Could not register equality function"))


I think t.Fatal* would be more appropriate?

Yeah, I'll fix that.

Signed-off-by: Nolan Brubaker <nolan@heptio.com>

ncdc · 2018-06-28T15:37:36Z

LGTM

nrb self-assigned this Jun 20, 2018

nrb force-pushed the backup-timing branch from af7ec76 to cada822 Compare June 20, 2018 22:33

nrb force-pushed the backup-timing branch from cada822 to 76edddc Compare June 21, 2018 15:29

nrb changed the title ~~WIP - Record backup start and completion times, add timing metrics~~ Record backup start and completion times, add timing metrics Jun 21, 2018

nrb requested review from ashish-amarnath, skriss and ncdc June 21, 2018 17:42

ashish-amarnath reviewed Jun 21, 2018

View reviewed changes

ncdc reviewed Jun 21, 2018

View reviewed changes

ncdc reviewed Jun 22, 2018

View reviewed changes

ashish-amarnath reviewed Jun 22, 2018

View reviewed changes

nrb force-pushed the backup-timing branch 2 times, most recently from ebb4686 to bf6e696 Compare June 25, 2018 16:58

nrb mentioned this pull request Jun 25, 2018

Restore metrics #607

Merged

nrb force-pushed the backup-timing branch 2 times, most recently from 28292ce to 9026487 Compare June 25, 2018 19:53

ncdc reviewed Jun 26, 2018

View reviewed changes

mattmoyer reviewed Jun 26, 2018

View reviewed changes

skriss reviewed Jun 27, 2018

View reviewed changes

ncdc reviewed Jun 27, 2018

View reviewed changes

nrb force-pushed the backup-timing branch from 196b400 to 2b407b8 Compare June 27, 2018 20:54

Record backup start and completion times

96b72ac

Signed-off-by: Nolan Brubaker <nolan@heptio.com>

nrb force-pushed the backup-timing branch from 2b407b8 to 96b72ac Compare June 28, 2018 15:18

ncdc merged commit 539de6d into vmware-tanzu:master Jun 28, 2018

ncdc added this to the v0.9.0 milestone Jun 28, 2018

ncdc added the Enhancement/User End-User Enhancement to Velero label Jun 28, 2018

		@@ -140,7 +136,7 @@ func (m *ServerMetrics) RegisterBackupFailed(backupSchedule string) {

		// RegisterBackupSeconds records the number of seconds a backup took.
		func (m *ServerMetrics) RegisterBackupSeconds(backupSchedule string, seconds float64) {

Record backup start and completion times, add timing metrics #564

Record backup start and completion times, add timing metrics #564

Conversation

nrb commented Jun 20, 2018

nrb commented Jun 20, 2018 • edited Loading

nrb commented Jun 20, 2018

ashish-amarnath commented Jun 21, 2018

ncdc commented Jun 21, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ncdc commented Jun 26, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nrb commented Jun 26, 2018

ncdc commented Jun 26, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevesloka commented Jun 26, 2018

nrb commented Jun 26, 2018

ncdc commented Jun 26, 2018

stevesloka commented Jun 26, 2018

skriss left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ncdc commented Jun 28, 2018

nrb commented Jun 20, 2018 •

edited

Loading

ncdc commented Jun 26, 2018 •

edited

Loading