-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Record backup start and completion times, add timing metrics #564
Conversation
CI's currently failing due to a known issue in Testify, stemming from a time package behavior. In summary, JSON marshalling and unmarshalling times in Go drop the monotonic clock, and alter the We can fix this with the @ncdc @skriss Any preferences on approach here? Is this something we need to be aware of in general for restores, given that all times going through Ark will encounter this issue? |
@ashish-amarnath Since you've been working on metrics code, I wanted to let you know I'm grabbing the backup time/duration ones in this PR. |
@nrb I see you are only capturing the backup duration for now. Feel free to get a hold of me either here or on the #ark-dr slack channel if you need to consult on the metrics package. |
|
pkg/metrics/metrics.go
Outdated
@@ -31,10 +31,17 @@ const ( | |||
backupAttemptCount = "backup_attempt_total" | |||
backupSuccessCount = "backup_success_total" | |||
backupFailureCount = "backup_failure_total" | |||
backupSecondsCount = "backup_seconds_total" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest re-naming to this backupDurationSeconds = "backup_duration_seconds"
.
The convention that I've seen is I count 'events' and then report them as the total number of events, seen in the look back time window.
In this case, what we are capturing is how long it took to run the backup which is not an event. It is kinda similar to the backup size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
pkg/metrics/metrics.go
Outdated
|
||
scheduleLabel = "schedule" | ||
) | ||
|
||
// SecondsInMinute returns the number of seconds in a minute. | ||
// This is mostly a helper to create Prometheus histogram buckets cleanly. | ||
func SecondsInMinute(minutes uint64) float64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure you need this method. Suggest creating a const like secondsInMinute = 60
and
Buckets: []float64{
SecondsInMinute(1 * secondsInMinute),
SecondsInMinute(2 * secondsInMinute),
SecondsInMinute(3 * secondsInMinute),
SecondsInMinute(4 * secondsInMinute),
SecondsInMinute(5 * secondsInMinute),
SecondsInMinute(6 * secondsInMinute),
SecondsInMinute(7 * secondsInMinute),
SecondsInMinute(8 * secondsInMinute),
SecondsInMinute(9 * secondsInMinute),
SecondsInMinute(10 * secondsInMinute),
},
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion, I'll go with a float64
constant and use it directly, without a function call.
pkg/metrics/metrics.go
Outdated
prometheus.HistogramOpts{ | ||
Namespace: metricNamespace, | ||
Name: backupSecondsCount, | ||
Help: "Total seconds taken for backups", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest changing this to
Name: backupDurationSeconds,
Help: "Time taken to complete backup, in seconds",
@@ -231,6 +235,19 @@ func TestProcessBackup(t *testing.T) { | |||
res.Status.Expiration.Time = expiration | |||
res.Status.Phase = v1.BackupPhase(phase) | |||
|
|||
// We don't care about the value of the timestamps here, just whether or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it not possible to use the value from the patch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is - I can extract it and parse it to make the test more accurate. I was taking a bit of a shortcut here to avoid that work, but I'll update it.
pkg/metrics/metrics.go
Outdated
@@ -140,7 +136,7 @@ func (m *ServerMetrics) RegisterBackupFailed(backupSchedule string) { | |||
|
|||
// RegisterBackupSeconds records the number of seconds a backup took. | |||
func (m *ServerMetrics) RegisterBackupSeconds(backupSchedule string, seconds float64) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Call this RegisterBackupDuration?
@@ -231,6 +235,23 @@ func TestProcessBackup(t *testing.T) { | |||
res.Status.Expiration.Time = expiration | |||
res.Status.Phase = v1.BackupPhase(phase) | |||
|
|||
// We don't care about the value of the timestamps here, just whether or | |||
// not they're present in the patchMap. | |||
// If there's an error, it's mostly likely that the key wasn't found |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: // If there's an error, it's most likely that the key wasn't found
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
pkg/metrics/metrics.go
Outdated
func (m *ServerMetrics) RegisterBackupFailed(backupSchedule string) { | ||
if c, ok := m.metrics[backupFailureCount].(*prometheus.CounterVec); ok { | ||
c.WithLabelValues(backupSchedule).Inc() | ||
} | ||
} | ||
|
||
// RegisterBackupSeconds records the number of seconds a backup took. | ||
func (m *ServerMetrics) RegisterBackupSeconds(backupSchedule string, seconds float64) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest naming this method RegisterBackupDuration
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
ebb4686
to
bf6e696
Compare
28292ce
to
9026487
Compare
After reading up a bit more on Prometheus histograms, this PR divides the backup duration into 10 buckets, from 1 to 10 Unfortunately, the buckets have to be hard-coded up front when we create the histogram. I'm wondering if 1-10 I'm definitely looking for guidance from people more experienced with metrics gathering than I am 😄 |
d.Println() | ||
// "<n/a>" output should only be applicable for backups that failed validation | ||
if status.StartTimestamp.Time.IsZero() { | ||
d.Printf("Start Time:\t%s\n", "<n/a>") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's call this Started
?
d.Printf("Start Time:\t%s\n", status.StartTimestamp.Time) | ||
} | ||
if status.CompletionTimestamp.Time.IsZero() { | ||
d.Printf("Completion Time:\t%s\n", "<n/a>") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's call this Completed
?
@@ -163,6 +163,18 @@ func DescribeBackupStatus(d *Describer, status v1.BackupStatus) { | |||
|
|||
d.Println() | |||
d.Printf("Expiration:\t%s\n", status.Expiration.Time) | |||
d.Println() | |||
// "<n/a>" output should only be applicable for backups that failed validation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would put started & completed above expiration
The buckets are definitely 1-10 minutes. https://github.com/prometheus/client_golang/blob/master/prometheus/histogram.go#L54 shows the default values, and states that the values are based on seconds. That said I'm not confident that this is a good distribution. It may take some testing to tune these. |
My mistake - 1-10 minutes is right. Thanks for the correction! |
pkg/metrics/metrics.go
Outdated
7 * secondsInMinute, | ||
8 * secondsInMinute, | ||
9 * secondsInMinute, | ||
10 * secondsInMinute, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will mean any backup longer than 10 minutes is counted as "infinite". Is that reasonable?
It might be useful to use exponential buckets to get more resolution at the higher end of this range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could also keep this distribution but cleanup the code a little using LinearBuckets
like:
Buckets: prometheus.LinearBuckets(0.0, secondsInMinute, 10),
The buckets can be tricky to define since they are typically specific to an environment. I'd recommend sane defaults to start, then allow users to define those buckets if they need to outside the defaults. |
@stevesloka How would you recommend letting users define them? For what it's worth, we're considering getting rid of the Ark |
Anything in config that doesn't move to a backup target will most likely become a flag to |
I'm not sure specifically in Ark, but should be something on the server since you can only configure it once. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not much to add, just one question
@@ -392,6 +393,11 @@ func (controller *backupController) runBackup(backup *api.Backup, bucket string) | |||
backupScheduleName := backup.GetLabels()["ark-schedule"] | |||
controller.metrics.SetBackupTarballSizeBytesGauge(backupScheduleName, backupSizeBytes) | |||
|
|||
backup.Status.CompletionTimestamp.Time = controller.clock.Now() | |||
backupDuration := backup.Status.CompletionTimestamp.Time.Sub(backup.Status.StartTimestamp.Time) | |||
backupDurationSeconds := float64(backupDuration / time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why's the / time.Second
needed? (Not saying it's wrong, just wondering)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch - that should be multiplied to get the seconds, since Duration is nanoseconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I'm confusing myself here - this is correct. We need to divide to get the seconds out of a Duration.
Multiplication would be seconds to nanoseconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To elaborate a bit more: the histogram we're building uses seconds as it's unit, which is why we're going from ns to seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it, ok. duration math is weird :)
pkg/metrics/metrics.go
Outdated
Name: backupDurationSeconds, | ||
Help: "Time taken to complete backup, in seconds", | ||
// Use 10 exponential buckets starting at 30s | ||
Buckets: prometheus.ExponentialBuckets(30.0, 3, 10), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would create the following buckets, correct?
- 30s
- 1m30s
- 4m30s
- 13m30s
- 40m30s
- 2h1m30s
- 6h4m30s
- 18h13m30s
- 54h40m30s
- 164h1m30s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah - I think 9 & 10 might be too high, perhaps even 8.
I think Linear buckets of 1 minute may leave out the high end if we only have 10, but I don't know if there's any sort of penalty for making many buckets that go unused.
pkg/util/test/comparisons.go
Outdated
err := equality.Semantic.AddFunc(TimesAreEqual) | ||
if err != nil { | ||
// Programmer error, the service should die. | ||
panic(errors.Wrap(err, "Could not register equality function")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think t.Fatal*
would be more appropriate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'll fix that.
Signed-off-by: Nolan Brubaker <nolan@heptio.com>
LGTM |
Signed-off-by: Nolan Brubaker nolan@heptio.com