fix(metrics): Reverts deletion of SpectatorMeterRegistry and enables opt in of breaking change to prometheus metrics/dashboards through the monitoring daemon. #725

fieldju · 2020-07-14T05:46:40Z

In PR #592 the default Spectator registry was changed from the default Spectator registry to a Spectator Micrometer registry.
Having the micrometer registry is nice because it enables things such as including Micrometer registries on the classpath and having them just work, or creating plugins to configure micrometer such as the Armory Observability Plugin

However: The Micrometer Registry L50 adds a statistic tag to all Metrics.

The monitoring daemon appends __value to all metrics that contain the tag statistic

These 2 things in combination mean that 1.20 introduced a BREAKING change that broke the existing Prometheus dashboards and alerting when using the monitoring daemon, as it changed the name of almost all metrics being reporting to Prometheus and potentially other metric integrations.

This PR partially reverts #592 and makes it opt-in by supplying spectator.micrometerRegistry.enabled: true as a Springboot prop.

I also tried to get fancy with making spring conditionally configure micrometer vs the default reg when custom Micrometer registries were supplied without any luck and had to add the explicit feature flag spectator.micrometerRegistry.enabled:

cc: @Aloren

…pt in of breaking change to prometheus metrics/dashboards through the monitoring daemon.

ethanfrogers

This looks fine to me but I'm not familiar with this area of work so I'll defer to Cam or Chris for the final say. AFAIK, we don't have a path for backporting Kork changes to patch releases so I'm not sure what we'll do to alleviate the actual regression in 1.20.

@plumpy @ezimanyi @maggieneterval - any thoughts here?

fieldju · 2020-07-14T15:38:02Z

This looks fine to me but I'm not familiar with this area of work so I'll defer to Cam or Chris for the final say. AFAIK, we don't have a path for backporting Kork changes to patch releases so I'm not sure what we'll do to alleviate the actual regression in 1.20.

@plumpy @ezimanyi @maggieneterval - any thoughts here?

@ethanfrogers it seems like PRs that are being merged right now are getting the target 1.20 label.

ezimanyi · 2020-07-14T16:08:41Z

I think the fact that PRs are being tagged with 1.20 is a bug in spinnakerbot (likely because we have a release-1.19.x branch in kork but not a release-1.20.x).

Ultimately, kork itself doesn't ship as a top-level service, so the release that includes a given change will depend on when kork was bumped to include that change in each microservice.

I'll leave it to the release manager to decide whether a cherry-pick of this would follow our cherry-pick policy, but if it does the rough steps are:

Make a release-1.20.x branch in kork that starts with the release of kork currently in the 1.20 version of the services
Cherry-pick the fix onto that release-1.20.x branch in kork
Make a release of kork from the release-1.20.x branch. I'm not sure what to version the release as to avoid conflicting with regular releases from master.
Cherry pick a kork bump to that new release into the release-1.20.x branch of each microservice.

I know that @jonsie and @link108 recently went through this process for some plugin work (maybe ~2 months ago) so they probably have the most recent experience with how to do the above.

fieldju · 2020-07-14T17:27:24Z

@ezimanyi does bintray / maven let you do major.minor.patch.someotherthing if not we can do major.minor.patch-hotfix.1 or something like that?

EDIT:

Scrap that, I guess in theory you would just bump the patch and make the 1.20 branches use the latest patch on what ever minor release they are on.

fieldju · 2020-07-14T17:56:57Z

Does it sound reasonable that we just leave it broken in 1.20 and fix it for 1.21 and communicate that the fix is to not use 1.20?

@ezimanyi, et al?

ezimanyi · 2020-07-14T18:15:49Z

I think in the past we might have picked a kork version like 7.40.1-patch1 to distinguish, though I'm not sure if that caused any issues with our parsing logic.

I'll defer to the release manager, but in general the idea of saying that 1.20 is completely broken doesn't sound great to me (unless this really only affects a tiny subset of users). If this was a regression and we can fix it safely, we probably should, even if it's more work than a normal fix.

jonsie

Would it make more sense to address this in the spinnaker-monitoring daemon? This seems like a heavy handed approach for what is essentially an integration issue with a specific tool. I'd rather keep Kork simple and not introduce another feature flag for all the Spinnaker services, and instead put a change in spinnaker-monitoring that optionally skips appending _value to metrics with statistic on them. Would that make sense? I'm not really sure what _value in spinnaker-monitoring is supposed to accomplish.

cfieber · 2020-07-14T18:21:01Z

+1 to @jonsie - at this point there isn't clear ownership on spinnaker-monitoring and it is basically abandonware. If we feel that is a core offering someone needs to step up and keep it relevant

fieldju · 2020-07-14T19:44:04Z

Would it make more sense to address this in the spinnaker-monitoring daemon? This seems like a heavy handed approach for what is essentially an integration issue with a specific tool. I'd rather keep Kork simple and not introduce another feature flag for all the Spinnaker services, and instead put a change in spinnaker-monitoring that optionally skips appending _value to metrics with statistic on them. Would that make sense? I'm not really sure what _value in spinnaker-monitoring is supposed to accomplish.

@jonsie re: ☝️ You can't really fix this in the monitoring daemon as you don't really know what metrics had the statistic tag before and which ones had it newly added. If this is the approach we wanted, I would say to just update the dashboards as it would probably be easier.

I personally am not a fan of the monitoring daemon and the extra layer of indirection it adds and would prefer the plugin approach that I have been working on (using Micrometer's integration with SB).

We are pushing the observability plugin which will have its own set of dashboards but eliminates the need for the daemon to run on each pod.

If I was going to take time to fix/create Prometheus dashboards, it would probably be for the plugin I have been working on, which has its own Prometheus metrics naming convention and not the daemon.

Let me talk internally about a couple options.

fieldju · 2020-07-14T19:45:08Z

+1 to @jonsie - at this point there isn't clear ownership on spinnaker-monitoring and it is basically abandonware. If we feel that is a core offering someone needs to step up and keep it relevant

@cfieber I do not want to support it, and have created a plugin that basically replaces it and will be supporting that model.

fieldju · 2020-07-15T16:09:17Z

Would it make more sense to address this in the spinnaker-monitoring daemon?

Yes, IFF it's updating the dashboards and optionally making changes in the naming convention.
Some metrics already had the statistic tag, so if you remove the _value logic all together than you break those charts/widgets.

I'd rather keep Kork simple and not introduce another feature flag for all the Spinnaker services

I'll +1 this.

I'm not really sure what _value in spinnaker-monitoring is supposed to accomplish.

It's the OPs attempt at accomplishing the following: https://prometheus.io/docs/practices/naming/#metric-names
When using the official Micrometer Prometheus integration like through the our plugin, this is done here: https://github.com/micrometer-metrics/micrometer/blob/master/implementations/micrometer-registry-prometheus/src/main/java/io/micrometer/prometheus/PrometheusNamingConvention.java

Although neither the naming convention or the official micrometer integration have _value 🤷 .

Personally, I think the Prometheus naming convention, or rather the action of renaming metrics, adds unneeded developer friction.
it's really confusing when metrics show up in the metrics store with a different name and you have to do mental mapping.

The current immediate plan is for @jasonmcintosh to update the dashboards by appending _value where it needs to go.

Our longterm plan is to create a new set of dashboards that are compatible with our plugin and declare that those are the only ones we support.

jasonmcintosh · 2020-07-15T16:12:12Z

Just an FYI - also hit some interesting scaling differences - apparently micrometer uses seconds where spectator was using nanoseconds - this is on totalTime vs. total. SO the conversion isn't super "simple". ALSO with the removal of hystrix, a lot of dashboards just need updating period. That's a more interesting challenge. Will try and get updated dashboards out, but I don't guarantee they'll be "great" with out more peer review of them...

plumpy · 2020-07-15T17:07:02Z

@fieldju Is your plugin open-source or is it Armory-only?

jasonmcintosh · 2020-07-15T18:12:30Z

Open Source :)

jasonmcintosh · 2020-07-15T18:12:54Z

https://github.com/armory-plugins/armory-observability-plugin/

jasonmcintosh · 2020-07-15T21:55:31Z

https://github.com/armory-io/spinnaker-monitoring/tree/UpdatedDashboards/spinnaker-monitoring-third-party/third_party/prometheus FYI I've got some updated dashboards here. I ALSO have an updated docker file that fixes some vulnerabilties. JUST in case others find this thread. Note, these are raw json exports, so they don't have the "go replace the datasource" kinda stuff that the scripts added. I figure someone wants that, can do it other ways.

fix(metrics): Revert deletion of SpectatorMeterRegistry and enabled o…

59215f9

…pt in of breaking change to prometheus metrics/dashboards through the monitoring daemon.

fieldju requested review from cfieber, jonsie and ethanfrogers July 14, 2020 05:46

Merge branch 'master' into fix/metrics

ff20171

fieldju mentioned this pull request Jul 14, 2020

feat(metrics): enable legacy Spectator meter registry when zero aop registries are enabled. armory-plugins/armory-observability-plugin#16

Closed

ethanfrogers approved these changes Jul 14, 2020

View reviewed changes

jonsie reviewed Jul 14, 2020

View reviewed changes

fieldju closed this Jul 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(metrics): Reverts deletion of SpectatorMeterRegistry and enables opt in of breaking change to prometheus metrics/dashboards through the monitoring daemon. #725

fix(metrics): Reverts deletion of SpectatorMeterRegistry and enables opt in of breaking change to prometheus metrics/dashboards through the monitoring daemon. #725

fieldju commented Jul 14, 2020 •

edited

Loading

ethanfrogers left a comment

fieldju commented Jul 14, 2020

ezimanyi commented Jul 14, 2020

fieldju commented Jul 14, 2020 •

edited

Loading

fieldju commented Jul 14, 2020

ezimanyi commented Jul 14, 2020

jonsie left a comment

cfieber commented Jul 14, 2020

fieldju commented Jul 14, 2020

fieldju commented Jul 14, 2020

fieldju commented Jul 15, 2020 •

edited

Loading

jasonmcintosh commented Jul 15, 2020

plumpy commented Jul 15, 2020

jasonmcintosh commented Jul 15, 2020

jasonmcintosh commented Jul 15, 2020

jasonmcintosh commented Jul 15, 2020

fix(metrics): Reverts deletion of SpectatorMeterRegistry and enables opt in of breaking change to prometheus metrics/dashboards through the monitoring daemon. #725

fix(metrics): Reverts deletion of SpectatorMeterRegistry and enables opt in of breaking change to prometheus metrics/dashboards through the monitoring daemon. #725

Conversation

fieldju commented Jul 14, 2020 • edited Loading

ethanfrogers left a comment

Choose a reason for hiding this comment

fieldju commented Jul 14, 2020

ezimanyi commented Jul 14, 2020

fieldju commented Jul 14, 2020 • edited Loading

fieldju commented Jul 14, 2020

ezimanyi commented Jul 14, 2020

jonsie left a comment

Choose a reason for hiding this comment

cfieber commented Jul 14, 2020

fieldju commented Jul 14, 2020

fieldju commented Jul 14, 2020

fieldju commented Jul 15, 2020 • edited Loading

jasonmcintosh commented Jul 15, 2020

plumpy commented Jul 15, 2020

jasonmcintosh commented Jul 15, 2020

jasonmcintosh commented Jul 15, 2020

jasonmcintosh commented Jul 15, 2020

fieldju commented Jul 14, 2020 •

edited

Loading

fieldju commented Jul 14, 2020 •

edited

Loading

fieldju commented Jul 15, 2020 •

edited

Loading