Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(metrics): Reverts deletion of SpectatorMeterRegistry and enables opt in of breaking change to prometheus metrics/dashboards through the monitoring daemon. #725

Closed
wants to merge 2 commits into from

Conversation

fieldju
Copy link
Contributor

@fieldju fieldju commented Jul 14, 2020

In PR #592 the default Spectator registry was changed from the default Spectator registry to a Spectator Micrometer registry.
Having the micrometer registry is nice because it enables things such as including Micrometer registries on the classpath and having them just work, or creating plugins to configure micrometer such as the Armory Observability Plugin

However: The Micrometer Registry L50 adds a statistic tag to all Metrics.

The monitoring daemon appends __value to all metrics that contain the tag statistic

These 2 things in combination mean that 1.20 introduced a BREAKING change that broke the existing Prometheus dashboards and alerting when using the monitoring daemon, as it changed the name of almost all metrics being reporting to Prometheus and potentially other metric integrations.

This PR partially reverts #592 and makes it opt-in by supplying spectator.micrometerRegistry.enabled: true as a Springboot prop.

I also tried to get fancy with making spring conditionally configure micrometer vs the default reg when custom Micrometer registries were supplied without any luck and had to add the explicit feature flag spectator.micrometerRegistry.enabled:

cc: @Aloren

…pt in of breaking change to prometheus metrics/dashboards through the monitoring daemon.
@fieldju fieldju changed the title fix(metrics): Revert deletion of SpectatorMeterRegistry and enabled opt in of breaking change to prometheus metrics/dashboards through the monitoring daemon. fix(metrics): Reverts deletion of SpectatorMeterRegistry and enables opt in of breaking change to prometheus metrics/dashboards through the monitoring daemon. Jul 14, 2020
Copy link
Contributor

@ethanfrogers ethanfrogers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine to me but I'm not familiar with this area of work so I'll defer to Cam or Chris for the final say. AFAIK, we don't have a path for backporting Kork changes to patch releases so I'm not sure what we'll do to alleviate the actual regression in 1.20.

@plumpy @ezimanyi @maggieneterval - any thoughts here?

@fieldju
Copy link
Contributor Author

fieldju commented Jul 14, 2020

This looks fine to me but I'm not familiar with this area of work so I'll defer to Cam or Chris for the final say. AFAIK, we don't have a path for backporting Kork changes to patch releases so I'm not sure what we'll do to alleviate the actual regression in 1.20.

@plumpy @ezimanyi @maggieneterval - any thoughts here?

@ethanfrogers it seems like PRs that are being merged right now are getting the target 1.20 label.

image

@ezimanyi
Copy link
Contributor

I think the fact that PRs are being tagged with 1.20 is a bug in spinnakerbot (likely because we have a release-1.19.x branch in kork but not a release-1.20.x).

Ultimately, kork itself doesn't ship as a top-level service, so the release that includes a given change will depend on when kork was bumped to include that change in each microservice.

I'll leave it to the release manager to decide whether a cherry-pick of this would follow our cherry-pick policy, but if it does the rough steps are:

  • Make a release-1.20.x branch in kork that starts with the release of kork currently in the 1.20 version of the services
  • Cherry-pick the fix onto that release-1.20.x branch in kork
  • Make a release of kork from the release-1.20.x branch. I'm not sure what to version the release as to avoid conflicting with regular releases from master.
  • Cherry pick a kork bump to that new release into the release-1.20.x branch of each microservice.

I know that @jonsie and @link108 recently went through this process for some plugin work (maybe ~2 months ago) so they probably have the most recent experience with how to do the above.

@fieldju
Copy link
Contributor Author

fieldju commented Jul 14, 2020

@ezimanyi does bintray / maven let you do major.minor.patch.someotherthing if not we can do major.minor.patch-hotfix.1 or something like that?

EDIT:

Scrap that, I guess in theory you would just bump the patch and make the 1.20 branches use the latest patch on what ever minor release they are on.

@fieldju
Copy link
Contributor Author

fieldju commented Jul 14, 2020

Does it sound reasonable that we just leave it broken in 1.20 and fix it for 1.21 and communicate that the fix is to not use 1.20?

@ezimanyi, et al?

@ezimanyi
Copy link
Contributor

I think in the past we might have picked a kork version like 7.40.1-patch1 to distinguish, though I'm not sure if that caused any issues with our parsing logic.

I'll defer to the release manager, but in general the idea of saying that 1.20 is completely broken doesn't sound great to me (unless this really only affects a tiny subset of users). If this was a regression and we can fix it safely, we probably should, even if it's more work than a normal fix.

Copy link
Contributor

@jonsie jonsie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make more sense to address this in the spinnaker-monitoring daemon? This seems like a heavy handed approach for what is essentially an integration issue with a specific tool. I'd rather keep Kork simple and not introduce another feature flag for all the Spinnaker services, and instead put a change in spinnaker-monitoring that optionally skips appending _value to metrics with statistic on them. Would that make sense? I'm not really sure what _value in spinnaker-monitoring is supposed to accomplish.

@cfieber
Copy link
Contributor

cfieber commented Jul 14, 2020

+1 to @jonsie - at this point there isn't clear ownership on spinnaker-monitoring and it is basically abandonware. If we feel that is a core offering someone needs to step up and keep it relevant

@fieldju
Copy link
Contributor Author

fieldju commented Jul 14, 2020

Would it make more sense to address this in the spinnaker-monitoring daemon? This seems like a heavy handed approach for what is essentially an integration issue with a specific tool. I'd rather keep Kork simple and not introduce another feature flag for all the Spinnaker services, and instead put a change in spinnaker-monitoring that optionally skips appending _value to metrics with statistic on them. Would that make sense? I'm not really sure what _value in spinnaker-monitoring is supposed to accomplish.

@jonsie re: ☝️ You can't really fix this in the monitoring daemon as you don't really know what metrics had the statistic tag before and which ones had it newly added. If this is the approach we wanted, I would say to just update the dashboards as it would probably be easier.

I personally am not a fan of the monitoring daemon and the extra layer of indirection it adds and would prefer the plugin approach that I have been working on (using Micrometer's integration with SB).

We are pushing the observability plugin which will have its own set of dashboards but eliminates the need for the daemon to run on each pod.

If I was going to take time to fix/create Prometheus dashboards, it would probably be for the plugin I have been working on, which has its own Prometheus metrics naming convention and not the daemon.

Let me talk internally about a couple options.

@fieldju
Copy link
Contributor Author

fieldju commented Jul 14, 2020

+1 to @jonsie - at this point there isn't clear ownership on spinnaker-monitoring and it is basically abandonware. If we feel that is a core offering someone needs to step up and keep it relevant

@cfieber I do not want to support it, and have created a plugin that basically replaces it and will be supporting that model.

@fieldju
Copy link
Contributor Author

fieldju commented Jul 15, 2020

Would it make more sense to address this in the spinnaker-monitoring daemon?

Yes, IFF it's updating the dashboards and optionally making changes in the naming convention.
Some metrics already had the statistic tag, so if you remove the _value logic all together than you break those charts/widgets.

I'd rather keep Kork simple and not introduce another feature flag for all the Spinnaker services

I'll +1 this.

I'm not really sure what _value in spinnaker-monitoring is supposed to accomplish.

It's the OPs attempt at accomplishing the following: https://prometheus.io/docs/practices/naming/#metric-names
When using the official Micrometer Prometheus integration like through the our plugin, this is done here: https://github.com/micrometer-metrics/micrometer/blob/master/implementations/micrometer-registry-prometheus/src/main/java/io/micrometer/prometheus/PrometheusNamingConvention.java

Although neither the naming convention or the official micrometer integration have _value 🤷 .

Personally, I think the Prometheus naming convention, or rather the action of renaming metrics, adds unneeded developer friction.
it's really confusing when metrics show up in the metrics store with a different name and you have to do mental mapping.

The current immediate plan is for @jasonmcintosh to update the dashboards by appending _value where it needs to go.

Our longterm plan is to create a new set of dashboards that are compatible with our plugin and declare that those are the only ones we support.

@fieldju fieldju closed this Jul 15, 2020
@jasonmcintosh
Copy link
Member

Just an FYI - also hit some interesting scaling differences - apparently micrometer uses seconds where spectator was using nanoseconds - this is on totalTime vs. total. SO the conversion isn't super "simple". ALSO with the removal of hystrix, a lot of dashboards just need updating period. That's a more interesting challenge. Will try and get updated dashboards out, but I don't guarantee they'll be "great" with out more peer review of them...

@plumpy
Copy link
Member

plumpy commented Jul 15, 2020

@fieldju Is your plugin open-source or is it Armory-only?

@jasonmcintosh
Copy link
Member

Open Source :)

@jasonmcintosh
Copy link
Member

@jasonmcintosh
Copy link
Member

https://github.com/armory-io/spinnaker-monitoring/tree/UpdatedDashboards/spinnaker-monitoring-third-party/third_party/prometheus FYI I've got some updated dashboards here. I ALSO have an updated docker file that fixes some vulnerabilties. JUST in case others find this thread. Note, these are raw json exports, so they don't have the "go replace the datasource" kinda stuff that the scripts added. I figure someone wants that, can do it other ways.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants