Skip to content

Latest commit

 

History

History
122 lines (104 loc) · 7.1 KB

monitoring.md

File metadata and controls

122 lines (104 loc) · 7.1 KB

Monitoring the Upjet runtime

The Kubernetes controller-runtime library provides a Prometheus metrics endpoint by default. The Upjet based providers including the upbound/provider-aws, upbound/provider-azure, upbound/provider-azuread and upbound/provider-gcp expose various metrics from the controller-runtime to help monitor the health of the various runtime components, such as the controller-runtime client, the leader election client, the controller workqueues, etc. In addition to these metrics, each controller also exposes various metrics related to the reconciliation of the custom resources and active reconciliation worker goroutines.

In addition to these metrics exposed by the controller-runtime, the Upjet based providers also expose metrics specific to the Upjet runtime. The Upjet runtime registers some custom metrics using the available extension mechanism, and are available from the default /metrics endpoint of the provider pod. Here are these custom metrics exposed from the Upjet runtime:

  • upjet_terraform_cli_duration: This is a histogram metric and reports statistics, in seconds, on how long it takes a Terraform CLI invocation to complete.
  • upjet_terraform_active_cli_invocations: This is a gauge metric and it's the number of active (running) Terraform CLI invocations.
  • upjet_terraform_running_processes: This is a gauge metric and it's the number of running Terraform CLI and Terraform provider processes.
  • upjet_resource_ttr: This is a histogram metric and it measures, in seconds, the time-to-readiness for managed resources.

Prometheus metrics can have labels associated with them to differentiate the characteristics of the measurements being made, such as differentiating between the CLI processes and the Terraform provider processes when counting the number of active Terraform processes running. Here is a list of labels associated with each of the above custom Upjet metrics:

  • Labels associated with the upjet_terraform_cli_duration metric:
    • subcommand: The terraform subcommand that's run, e.g., init, apply, plan, destroy, etc.
    • mode: The execution mode of the Terraform CLI, one of sync (so that the CLI was invoked synchronously as part of a reconcile loop), async (so that the CLI was invoked asynchronously, the reconciler goroutine will poll and collect results in future).
  • Labels associated with the upjet_terraform_active_cli_invocations metric:
    • subcommand: The terraform subcommand that's run, e.g., init, apply, plan, destroy, etc.
    • mode: The execution mode of the Terraform CLI, one of sync (so that the CLI was invoked synchronously as part of a reconcile loop), async (so that the CLI was invoked asynchronously, the reconciler goroutine will poll and collect results in future).
  • Labels associated with the upjet_terraform_running_processes metric:
    • type: Either cli for Terraform CLI (the terraform process) processes or provider for the Terraform provider processes. Please note that this is a best effort metric that may not be able to precisely catch & report all relevant processes. We may, in the future, improve this if needed by for example watching the fork system calls. But currently, it may prove to be useful to watch rouge Terraform provider processes.
  • Labels associated with the upjet_resource_ttr metric:

Examples

You can export all these custom metrics and the controller-runtime metrics from the provider pod for Prometheus. Here are some examples showing the custom metrics in action from the Prometheus console:

  • upjet_terraform_active_cli_invocations gauge metric showing the sync & async terraform init/apply/plan/destroy invocations: image

  • upjet_terraform_running_processes gauge metric showing both cli and provider labels: image

  • upjet_terraform_cli_duration histogram metric, showing average Terraform CLI running times for the last 5m: image

  • The medians (0.5-quantiles) for these observations aggregated by the mode and Terraform subcommand being invoked: image

  • upjet_resource_ttr histogram metric, showing average resource TTR for the last 10m: image

  • The median (0.5-quantile) for these TTR observations: image

These samples have been collected by provisioning 10 upbound/provider-aws cognitoidp.UserPool resources by running the provider with a poll interval of 1m. In these examples, one can observe that the resources were polled (reconciled) twice after they acquired the Ready=True condition and after that, they were destroyed.

Reference

You can find a full reference of the exposed metrics from the Upjet-based providers here.