Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose metrics from razee controllers #197

Open
gregswift opened this issue Feb 8, 2021 · 1 comment
Open

Expose metrics from razee controllers #197

gregswift opened this issue Feb 8, 2021 · 1 comment
Labels
enhancement New feature or request

Comments

@gregswift
Copy link
Contributor

Is your feature request related to a problem? Please describe.
As we are settling into running Razee for production we have found a few things it would be nice to monitor and alert on.

We are trying to answer questions like:

  • How many locked resources are in the cluster?
  • Are runs completing successfully?
  • when was the last time a controller completed a run?
  • how long are runs taking?
  • How long is each resource taking? (this is more for future exploration and enhancements)

Describe the solution you'd like
It would be nice to see an openmetrics compatible set of metrics exposed that could easily be scraped by prometheus/sysdig/other openmetrics agents from each controller. The types of metrics we think might help address the above questions include:

  • Number of resources from last run, with a breakdown by
    • success
    • failed
    • skipped due to debug flag
  • A heat map of time to process each resource
  • Bool state of cluster lock state
  • Last run completion time

Describe alternatives you've considered
We've thought about trying to figure some of this out strictly from logs or by writing scripts to scrape the environment. Will probably do some here, but its less "native" and shareable.

@alewitt2 alewitt2 added the enhancement New feature or request label Feb 8, 2021
@esatterwhite
Copy link
Contributor

Would also be interesting to see the number of env lookups each resource has to make.

  • counter with a resource type + resource name

It would also be good to expose the nodejs metrics that you get from the standard prom client

  • heap usage
  • event loop lag
  • GC events by type

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants