Update prometheus helmfiles and rules to better use metric labels #304

willgraf · 2020-03-27T17:15:36Z

In an effort to improve the autoscaling, each prometheus-related chart has been updated to the latest version. In updating the charts, I found that some of the values became stale (especially the image version), so I removed all values that we did not need/want to set ourselves. They just inherit from the default value set, with a link provided in the helmfile.

Updated prometheus-redis-exporter to 3.3.3
Updated prometheus-operator to 8.12.3
Updated prometheus-adapter to 2.1.3

Additionally, prometheus metrics use labels to do math on 2 or more time series. To get our labels to match up, I made a few changes:

Update the redis-exporter to have the key name be queue and queue-zip instead of queue_image_keys and queue_zip_keys.
Add a metric_relabel_config to the redis-exporter prometheus job to take the queue name and include a new label, deployment="queue-consumer".
Refactor zip-consumer to segmentation-zip-consumer in order to have the labels match the queue name.

Also, I updated our rules:

Added new rules consumer_key_ratio and consumers_per_gpu which use the new labels to calculate stats for all deployed consumers (if their name X-consumer matches the queue X).
Changed the GPU metrics to a single tf_serving_gpu_usage metric.
Removed the avg_over_time calls, they are outputting discrete points instead of a nice line.

using .75 for now as anecdotally seems to work. Ideal number could be more or less.

MekWarrior

Looks good, well done!

* Update prometheus-operator, prometheus-adapter, and promethes-redis-exporter helm charts and remove stale default values * Relabel redis-exporter with `deployment=$QUEUE-consumer` and change key to be `$QUEUE` * Rename zip-consumer to segmentation-zip-consumer to match labels. * Using .75 instead of .9 for backoff coefficient.

willgraf added 9 commits March 26, 2020 19:04

update prometheus-operator helm chart and remove stale default values

e23392c

update prometheus-adapter helm chart and remove stale default values

00aacce

update prometheus rules and relabel the redis-exporter metrics

1b9b2a6

rename key for redis-exporter as the queue name

2405262

rename zip-consumer to segmentation-zip-consumer to match labels.

46eee81

update prometheus-redis-exporter to 3.3.3

6ed077e

remove deprecated HPA

234b16c

invalid intervals!

167ec19

add scrape_interval for tensorflow back in

e1cd346

willgraf requested a review from MekWarrior March 27, 2020 17:15

willgraf added 2 commits March 27, 2020 10:39

delete the CRD thanosrulers.monitoring.coreos.com too

7e0e66d

backoff coefficient must be < .9 for scaledown.

4f16dd1

using .75 for now as anecdotally seems to work. Ideal number could be more or less.

MekWarrior approved these changes Mar 28, 2020

View reviewed changes

willgraf merged commit 465f869 into stable Mar 28, 2020

willgraf deleted the willgraf/prometheus-update branch March 28, 2020 22:23

willgraf mentioned this pull request Apr 3, 2020

Optimizing Tf-Serving/Redis-Consumer Interaction #286

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update prometheus helmfiles and rules to better use metric labels #304

Update prometheus helmfiles and rules to better use metric labels #304

willgraf commented Mar 27, 2020 •

edited

Loading

MekWarrior left a comment

Update prometheus helmfiles and rules to better use metric labels #304

Update prometheus helmfiles and rules to better use metric labels #304

Conversation

willgraf commented Mar 27, 2020 • edited Loading

MekWarrior left a comment

Choose a reason for hiding this comment

willgraf commented Mar 27, 2020 •

edited

Loading