Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Record backup start and completion times, add timing metrics #564

Merged
merged 1 commit into from
Jun 28, 2018

Conversation

nrb
Copy link
Contributor

@nrb nrb commented Jun 20, 2018

Signed-off-by: Nolan Brubaker nolan@heptio.com

@nrb nrb self-assigned this Jun 20, 2018
@nrb
Copy link
Contributor Author

nrb commented Jun 20, 2018

CI's currently failing due to a known issue in Testify, stemming from a time package behavior. In summary, JSON marshalling and unmarshalling times in Go drop the monotonic clock, and alter the time.Location property. This affects Ark because the startTimestamp and completionTimestamp values will be updated via HTTP patch calls with JSON payloads.

We can fix this with the time.Equal function (which the Go docs recommend), but will need to do a bit of extra conversion from the interface{} that ValidatePatch currently compares, and probably write a helper to hook into the *testing.T facilities. We could also use apimachinery's DeepEqual function, though it's using the advised-against == comparison method.

@ncdc @skriss Any preferences on approach here? Is this something we need to be aware of in general for restores, given that all times going through Ark will encounter this issue?

@nrb
Copy link
Contributor Author

nrb commented Jun 20, 2018

@ashish-amarnath Since you've been working on metrics code, I wanted to let you know I'm grabbing the backup time/duration ones in this PR.

@ashish-amarnath
Copy link
Contributor

@nrb I see you are only capturing the backup duration for now. Feel free to get a hold of me either here or on the #ark-dr slack channel if you need to consult on the metrics package.

@ncdc
Copy link
Contributor

ncdc commented Jun 21, 2018

We could also use apimachinery's DeepEqual function, though it's using the advised-against == comparison method.

UTC() actually ends up stripping the monotonic clock field, so == works in this case. I vote we go this route for now.

@nrb nrb changed the title WIP - Record backup start and completion times, add timing metrics Record backup start and completion times, add timing metrics Jun 21, 2018
@@ -31,10 +31,17 @@ const (
backupAttemptCount = "backup_attempt_total"
backupSuccessCount = "backup_success_total"
backupFailureCount = "backup_failure_total"
backupSecondsCount = "backup_seconds_total"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest re-naming to this backupDurationSeconds = "backup_duration_seconds".
The convention that I've seen is I count 'events' and then report them as the total number of events, seen in the look back time window.
In this case, what we are capturing is how long it took to run the backup which is not an event. It is kinda similar to the backup size.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1


scheduleLabel = "schedule"
)

// SecondsInMinute returns the number of seconds in a minute.
// This is mostly a helper to create Prometheus histogram buckets cleanly.
func SecondsInMinute(minutes uint64) float64 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure you need this method. Suggest creating a const like secondsInMinute = 60 and

Buckets: []float64{
	SecondsInMinute(1 * secondsInMinute),
	SecondsInMinute(2 * secondsInMinute),
	SecondsInMinute(3 * secondsInMinute),
	SecondsInMinute(4 * secondsInMinute),
	SecondsInMinute(5 * secondsInMinute),
	SecondsInMinute(6 * secondsInMinute),
	SecondsInMinute(7 * secondsInMinute),
	SecondsInMinute(8 * secondsInMinute),
	SecondsInMinute(9 * secondsInMinute),
	SecondsInMinute(10 * secondsInMinute),
},

WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion, I'll go with a float64 constant and use it directly, without a function call.

prometheus.HistogramOpts{
Namespace: metricNamespace,
Name: backupSecondsCount,
Help: "Total seconds taken for backups",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest changing this to

Name: backupDurationSeconds,
Help: "Time taken to complete backup, in seconds",

@@ -231,6 +235,19 @@ func TestProcessBackup(t *testing.T) {
res.Status.Expiration.Time = expiration
res.Status.Phase = v1.BackupPhase(phase)

// We don't care about the value of the timestamps here, just whether or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it not possible to use the value from the patch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is - I can extract it and parse it to make the test more accurate. I was taking a bit of a shortcut here to avoid that work, but I'll update it.

@@ -140,7 +136,7 @@ func (m *ServerMetrics) RegisterBackupFailed(backupSchedule string) {

// RegisterBackupSeconds records the number of seconds a backup took.
func (m *ServerMetrics) RegisterBackupSeconds(backupSchedule string, seconds float64) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call this RegisterBackupDuration?

@@ -231,6 +235,23 @@ func TestProcessBackup(t *testing.T) {
res.Status.Expiration.Time = expiration
res.Status.Phase = v1.BackupPhase(phase)

// We don't care about the value of the timestamps here, just whether or
// not they're present in the patchMap.
// If there's an error, it's mostly likely that the key wasn't found
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: // If there's an error, it's most likely that the key wasn't found

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

func (m *ServerMetrics) RegisterBackupFailed(backupSchedule string) {
if c, ok := m.metrics[backupFailureCount].(*prometheus.CounterVec); ok {
c.WithLabelValues(backupSchedule).Inc()
}
}

// RegisterBackupSeconds records the number of seconds a backup took.
func (m *ServerMetrics) RegisterBackupSeconds(backupSchedule string, seconds float64) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest naming this method RegisterBackupDuration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@nrb nrb force-pushed the backup-timing branch 2 times, most recently from ebb4686 to bf6e696 Compare June 25, 2018 16:58
@nrb nrb mentioned this pull request Jun 25, 2018
@nrb nrb force-pushed the backup-timing branch 2 times, most recently from 28292ce to 9026487 Compare June 25, 2018 19:53
@ncdc
Copy link
Contributor

ncdc commented Jun 26, 2018

After reading up a bit more on Prometheus histograms, this PR divides the backup duration into 10 buckets, from 1 to 10 seconds minutes, plus an additional +Inf bucket for anything over 10 seconds minutes. The histogram is essentially a group of counters, and it's cumulative, so if we execute 1 backup and it takes 8 seconds minutes to run, the buckets for 8, 9, 10, and +Inf would all show 1 (remember, it's a count, not an actual duration).

Unfortunately, the buckets have to be hard-coded up front when we create the histogram. I'm wondering if 1-10 seconds minutes is reasonable. I'm thinking we might need a greater range, like 5s, 30s, 1m, 5m, 15m, 30m, etc.

I'm definitely looking for guidance from people more experienced with metrics gathering than I am 😄

d.Println()
// "<n/a>" output should only be applicable for backups that failed validation
if status.StartTimestamp.Time.IsZero() {
d.Printf("Start Time:\t%s\n", "<n/a>")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's call this Started?

d.Printf("Start Time:\t%s\n", status.StartTimestamp.Time)
}
if status.CompletionTimestamp.Time.IsZero() {
d.Printf("Completion Time:\t%s\n", "<n/a>")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's call this Completed?

@@ -163,6 +163,18 @@ func DescribeBackupStatus(d *Describer, status v1.BackupStatus) {

d.Println()
d.Printf("Expiration:\t%s\n", status.Expiration.Time)
d.Println()
// "<n/a>" output should only be applicable for backups that failed validation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put started & completed above expiration

@nrb
Copy link
Contributor Author

nrb commented Jun 26, 2018

this PR divides the backup duration into 10 buckets, from 1 to 10 seconds, plus an additional +Inf bucket for anything over 10 seconds.

The buckets are definitely 1-10 minutes. https://github.com/prometheus/client_golang/blob/master/prometheus/histogram.go#L54 shows the default values, and states that the values are based on seconds.

That said I'm not confident that this is a good distribution. It may take some testing to tune these.

@ncdc
Copy link
Contributor

ncdc commented Jun 26, 2018

My mistake - 1-10 minutes is right. Thanks for the correction!

7 * secondsInMinute,
8 * secondsInMinute,
9 * secondsInMinute,
10 * secondsInMinute,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will mean any backup longer than 10 minutes is counted as "infinite". Is that reasonable?

It might be useful to use exponential buckets to get more resolution at the higher end of this range.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also keep this distribution but cleanup the code a little using LinearBuckets like:

    Buckets: prometheus.LinearBuckets(0.0, secondsInMinute, 10),

@stevesloka
Copy link

The buckets can be tricky to define since they are typically specific to an environment. I'd recommend sane defaults to start, then allow users to define those buckets if they need to outside the defaults.

@nrb
Copy link
Contributor Author

nrb commented Jun 26, 2018

@stevesloka How would you recommend letting users define them? For what it's worth, we're considering getting rid of the Ark Config CRD.

@ncdc
Copy link
Contributor

ncdc commented Jun 26, 2018

Anything in config that doesn't move to a backup target will most likely become a flag to ark server

@stevesloka
Copy link

I'm not sure specifically in Ark, but should be something on the server since you can only configure it once.

Copy link
Member

@skriss skriss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not much to add, just one question

@@ -392,6 +393,11 @@ func (controller *backupController) runBackup(backup *api.Backup, bucket string)
backupScheduleName := backup.GetLabels()["ark-schedule"]
controller.metrics.SetBackupTarballSizeBytesGauge(backupScheduleName, backupSizeBytes)

backup.Status.CompletionTimestamp.Time = controller.clock.Now()
backupDuration := backup.Status.CompletionTimestamp.Time.Sub(backup.Status.StartTimestamp.Time)
backupDurationSeconds := float64(backupDuration / time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why's the / time.Second needed? (Not saying it's wrong, just wondering)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch - that should be multiplied to get the seconds, since Duration is nanoseconds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I'm confusing myself here - this is correct. We need to divide to get the seconds out of a Duration.

Multiplication would be seconds to nanoseconds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To elaborate a bit more: the histogram we're building uses seconds as it's unit, which is why we're going from ns to seconds.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, ok. duration math is weird :)

Name: backupDurationSeconds,
Help: "Time taken to complete backup, in seconds",
// Use 10 exponential buckets starting at 30s
Buckets: prometheus.ExponentialBuckets(30.0, 3, 10),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would create the following buckets, correct?

  1. 30s
  2. 1m30s
  3. 4m30s
  4. 13m30s
  5. 40m30s
  6. 2h1m30s
  7. 6h4m30s
  8. 18h13m30s
  9. 54h40m30s
  10. 164h1m30s

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - I think 9 & 10 might be too high, perhaps even 8.

I think Linear buckets of 1 minute may leave out the high end if we only have 10, but I don't know if there's any sort of penalty for making many buckets that go unused.

err := equality.Semantic.AddFunc(TimesAreEqual)
if err != nil {
// Programmer error, the service should die.
panic(errors.Wrap(err, "Could not register equality function"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think t.Fatal* would be more appropriate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'll fix that.

Signed-off-by: Nolan Brubaker <nolan@heptio.com>
@ncdc
Copy link
Contributor

ncdc commented Jun 28, 2018

LGTM

@ncdc ncdc merged commit 539de6d into vmware-tanzu:master Jun 28, 2018
@ncdc ncdc added this to the v0.9.0 milestone Jun 28, 2018
@ncdc ncdc added the Enhancement/User End-User Enhancement to Velero label Jun 28, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement/User End-User Enhancement to Velero
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants